Normal Distribution Standard Deviations
So, if you stopped a random woman on the street, you could use these facts to form a likely guess for her height. A guess of around five feet four inches (162 centimeters) would be best, as that’s the mean. Additionally, you could, with about two-to-one odds, guess that she will have a height between five feet one inch and five feet seven inches. That’s because the standard deviation of women’s heights is slightly less than three inches, so about two-thirds of women’s heights will fall within that range (within one standard deviation of the mean). By contrast, women shorter than four feet ten inches or taller than five feet ten inches make up less than about 5 percent of all women (outside two standard deviations from the mean).
Probability Distributions
Log-normal distribution
Applies to phenomena that follow a power law relationship, such as wealth, the size of cities, and insurance losses.
Poisson distribution
Applies to independent and random events that occur in an interval of time or space, such as lightning strikes or numbers of murders in a city.
Exponential distribution
Applies to the timing of events, such as the survival of people and products, service times, and radioactive particle decay.
There are many other common probability distributions besides the normal distribution that are useful across a variety of circumstances. A few are depicted in the figure on the previous page.
We called this section “The Bell Curve,” however, because the normal distribution is especially useful due to one of the handiest results in all of statistics, called the central limit theorem. This theorem states that when numbers are drawn from the same distribution and then are averaged, this resulting average approximately follows a normal distribution. This is the case even if the numbers originally came from a completely different distribution.
To appreciate what this theorem means and why it is so useful, consider the familiar opinion poll that determines an approval rating, such as for the U.S. Congress. Each person is asked whether they approve of Congress or not. That means the individual data points are each just a yes or a no.
This type of data looks nothing like a normal distribution, as each data point can take only one of two possible values. Binary data like this is often analyzed using a different probability distribution, called the Bernoulli distribution, which represents the result of a single yes/no-type experiment or question, such as from a survey or poll. This distribution is useful in a wide variety of situations, such as analyzing advertising campaigns (whether someone purchased or not), clinical trials (responded to treatment or not), and A/B testing (clicked or not).
The estimated approval rating is just an average of all of the different individual answers (1 for approval and 0 otherwise). For example, if 1,000 people were polled and 240 approved, then the approval rating would be 24.0 percent. The central limit theorem tells us that this statistical average (sample mean) is approximately normally distributed (assuming enough people participate in the survey). The figure on the next page illustrates how this works visually with the Bernoulli distribution and two others that also initially look nothing like the normal distribution.
The middle column shows how the distribution of the sample mean from a Bernoulli distribution, made up of a series of ones and zeros, ends up looking like a bell curve. The first row depicts a distribution with a 75 percent chance of disapproval (the spike at 0 on the left) and a 25 percent chance of approval (the spike at 1 on the right). This 25 percent chance is based on the approval rating across the whole country, if you polled everyone. Each person in a poll comes from this population distribution.
When you take a poll, you get only an estimate of the overall approval rating (like the 24 percent approval rating estimate mentioned earlier). When you do that, you are taking a sample from the entire population (e.g., asking one thousand people) and averaging the results to calculate the estimate. This sample mean has a distribution itself, called the sample distribution, which describes the chances of getting each possible approval rating from the sample. You can think of this distribution as the result of plotting the different approval ratings (sample means) obtained from many, many polls.
Central Limit Theorem
Uniform distribution
Bernouli distribution
Exponential distribution
Population distribution
values of x
values of x
values of x
Sample distribution of x average, sample size = 2
values of x average
values of x average
values of x average
Sample distribution of x average, sample size = 5
values of x average
values of x average
values of x average
Sample distribution of x average, sample size = 30
values of x average
values of x average
values of x average
The second row shows a plot of this sample distribution for an approval rating based on polling two randomly selected people. This plot looks different from the original distribution, but still nothing like a normal distribution, as it can have only three outcomes: two approvals (the spike at 1), two disapprovals (the spike at 0), or one approval and one disapproval (the spike at 0.5).
If you base the polls on asking five people, the sample distribution starts to look a bit more like a bell shape with six possible outcomes (third row). With thirty people (thirty-one outcomes, depicted in the fourth row), it starts to look a lot like the characteristic bell-curve shape of the normal distribution.
As you ask more and more people, the sample distribution becomes more and more like a normal distribution, with a mean of 25 percent, the true approval rating from the population distribution. Just as in the case of body temperatures or heights, while this mean is the most likely value obtained by the poll, values close to it are also likely, such as 24 percent. Values further and further away are less and less likely, with probabilities following the normal distribution.
How much less likely, exactly? It depends on how many people you ask. The more people you ask, the tighter the distribution. To convey this information, polls like this usually report a margin of error. An article describing the poll results might include something like “Congress has a 24 percent approval rating with a margin of error of ±3 percent.” The “±3 percent” is the margin of error, but where this margin of error comes from or what it really means is rarely explained. With knowledge of the above mental models, you now can know!
The margin of error is really a type of confidence interval, an estimated range of numbers that you think may include the true value of the parameter you are studying, e.g., the approval rating. This range has a corresponding confidence level, which quantifies the level of confidence you have that the true value of parameter is in the range you estimated. For example, a confidence level of 95 percent tells you that if you ran the poll many times and calculated many confidence intervals (one for each poll), on average 95 percent of them would include the true approval rating (i.e., 25 percent).
Most media reports don’t mention the confidence level used to calculate their margin of error, but it is usually safe to assume they used 95 percent. Research publications, by contrast, are usually more explicit in stating what confidence levels they used to represent the uncertainty in their estimates (again typically, though not always, 95 percent).
For the approval-rating scenario, the range is calculated using the fact that the central limit theorem tells us that the sample mean is approximately normally distributed, so we should expect 95 percent of possible values to be found within two standard deviations of the true mean (i.e., the real approval rating).
The part that hasn’t been explained yet is that the standard deviation of this distribution, also called the standard error, is not the same as the sample standard deviation calculation from earlier. How
ever, these two values are directly related. In particular, the standard error is the same as the standard deviation for the sample, divided by the square root of the sample size. This means that if you want to reduce the margin of error by a factor of two, you need to increase the sample size by a factor of four. For a yes/no poll like the approval rating, a margin of error of 10 percent is achieved with just 96 people, 5 percent at 384 people, 3 percent at 1,067 people, and 2 percent at 2,401. Since the margin of error is an expression of how confident the pollsters are in their estimate, it makes sense that it is directly related to the size of the sample group.
The illustration on the next page shows how confidence intervals work for repeated experiments. It depicts one hundred 95 percent confidence intervals for the probability of flipping heads. Each was calculated from an experiment that involved simulating flipping a fair coin one hundred times. These confidence intervals are represented graphically in the figure by error bars, which are a visual way to display a measure of uncertainty for an estimate.
95% Confidence Intervals from 100 Fair Coin Flips
Experiment Repeated 100 Times
Error bars are not always confidence intervals; they could be derived from other types of error calculations too. On an error bar, the dot in the middle is the parameter estimate, in this case the sample mean, and the lines at the end indicate the top and bottom of the range, in this case the confidence interval.
The error bars in the plot vary due to what was seen in the different experiments, but they each span a range of about twenty percentage points, which corresponds to the ±10 percent mentioned above (with a sample size of one hundred flips). Given the 95 percent confidence level, you would expect ninety-five of these confidence intervals to include the true mean of 50 percent. In this case, ninety-three of the intervals included 50 percent. (The seven intervals that didn’t are highlighted in black.)
Confidence intervals like these are often used as estimates of reasonable values for a parameter, such as the probability of getting heads. However, as you just saw, the true value of the parameter (in this case 50 percent) is sometimes outside a given confidence interval. The lesson here is, you should know that a confidence interval is not the definitive range for all possible values, and the true value is not guaranteed to be included in the interval.
One thing that really bothers us is when statistics are reported in the media without error bars or confidence intervals. Always remember to look for them when reading reports and to include them in your own work. Without an error estimate, you have no idea how confident to be in that number—is the true value likely really close to it, or could it be really far away from it? The confidence interval tells you that!
IT DEPENDS
As you saw in the last section, the average woman’s height is five feet four inches. If you had to guess the height of a random stranger, but you didn’t know for a fact that they were a woman, five feet four inches wouldn’t be a great guess because the average man is closer to five feet nine inches, and so something in the middle would be better. But if you had the additional information that the person was a woman, then five feet four inches would be the best guess. The additional information changes the probability.
This is an example of a model called conditional probability, the probability of one thing happening under the condition that something else also happened. Conditional probability allows us to better estimate probabilities by using this additional information.
Conditional probabilities are common in everyday life. For example, home insurance rates are tailored to the differing conditional probabilities of insurance claims (e.g., premiums are higher in coastal Florida, where hurricane damage is more likely, relative to where we live in Pennsylvania). Similarly, genetic testing can tell you if you are at higher risk for certain diseases; women with abnormal BRCA1 or BRCA2 genes have up to an 80 percent risk of developing breast cancer by age ninety.
Conditional probability is denoted with a | symbol. For example, the probability (P) that you will get breast cancer by age ninety given that you are a woman with a BRCA mutation would be denoted as P(breast cancer by ninety | woman with BRCA mutation).
Some people find conditional probabilities confusing. They mix up the probability that an event A will happen given a condition that event B happened—P(A|B)—with the probability that an event B will happen given the condition that event A happened—P(B|A). This is known as the inverse fallacy, whereby people think that P(A|B) and P(B|A) must have similar probabilities. While you just saw that P(breast cancer by ninety | woman with BRCA mutation) is about 80 percent, by contrast P(woman with BRCA mutation | breast cancer by ninety) is only 5 to 10 percent, because many other people develop breast cancer who do not have these mutations.
Let’s walk through a longer example to see this fallacy in action. Suppose the police pull someone over at random at a drunk-driving checkpoint and administer a Breathalyzer test that indicates they are drunk. Further, suppose the test is wrong on average 5 percent of the time, saying that a sober person is drunk. What is the probability that this person is wrongly accused of drunk driving?
Your first inclination might be to say 5 percent. However, you have been given the probability that the test says someone is drunk given they are sober, or P(Test=drunk | Person=sober) = 5 percent. But what you have been asked for is the probability that the person is sober given that the test says they are drunk, or P(Person=sober | Test=drunk). These are not the same probabilities!
What you haven’t considered is how the results depend on the base rate of the percentage of drunk drivers. Consider the scenario where everyone makes the right decision, and no one ever drives drunk. In this case the probability that a person is sober is 100 percent, regardless of what the Breathalyzer test results say. When a probability calculation fails to account for the base rate (like the base rate of drunk drivers), the mistake that is made is called the base rate fallacy.
Let’s consider a more realistic base rate, where one in a thousand drivers is drunk, meaning that there is a small chance (0.1 percent) that a person the police randomly pull over is drunk. And since we know one in twenty tests will be wrong (the tests will be wrong 5 percent of the time), the police will most likely go through a lot of wrong tests before they find a person who was actually drunk-driving.
In fact, if the police stop a thousand people, they would on average conduct nearly fifty wrong tests along their way to finding one actual drunk driver. So there is approximately only a 2 percent chance that a failed Breathalyzer in this scenario indicates that the person is actually drunk. Alternatively, this can be stated as a 98 percent chance that the person is sober. That’s way, way more than 5 percent!
So, P(A|B) does not equal P(B|A), but how are they related? There is a very useful result in probability called Bayes’ theorem, which tells us the relationship between these two conditional probabilities. On the next page, you will see how Bayes’ theorem relates these probabilities and how, in the drunk-driving example, Bayes’ theorem could be applied to calculate the 2 percent result.
Bayes’ Theorem
Base Rate Fallacy
Now that you know about Bayes’ theorem, you should also know that there are two schools of thought in statistics, based on different ways to think about probability: Frequentist and Bayesian. Most studies you hear about in the news are based on frequentist statistics, which relies on and requires many observations of an event before it can make reliable statistical determinations. Frequentists view probability as fundamentally tied to the frequency of events.
By observing the frequency of results over a large sample (e.g., asking a large number of people if they approve of Congress), frequentists estimate an unknown quantity. If there are very few data points, however, they can’t say much of anything, since the confidence intervals they can calculate will be extremely large. In their view, probability without observations makes no sense.
Bayesians, by contrast, allow probabilistic judgments
about any situation, regardless of whether any observations have yet occurred. To do this, Bayesians begin by bringing related evidence to statistical determinations. For example, picking a penny up off the street, you’d probably initially estimate a fifty-fifty chance that it would come up heads if you flipped it, even if you’d never observed a flip of that particular coin before. In Bayesian statistics, you can bring such knowledge of base rates to a problem. In frequentist statistics, you cannot.
Many people find this Bayesian way of looking at probability more intuitive because it is similar to how your beliefs naturally evolve. In everyday life, you aren’t starting from scratch every time, as you would in frequentist statistics. For instance, on policy issues, your starting point is what you currently know on that topic—what Bayesians call a prior—and then when you get new data, you (hopefully) update your prior based on the new information. The same is true for relationships, with your starting point being your previous experiences with that person; a strong prior would be a lifelong relationship, whereas a weak prior would be just a first impression.
You saw in the last section that frequentist statistics produce confidence intervals. These statistics tell you that if you ran an experiment many times (e.g., the one-hundred-coin-flips example we presented), the confidence intervals calculated should contain the parameter you are studying (e.g., 50 percent probability of getting heads) to the level of confidence specified (e.g., 95 percent of the time). To many people’s dismay, a confidence interval does not say there is a 95 percent chance of the true value of the parameter being in the interval. By contrast, Bayesian statistics analogously produces credible intervals, which do say that; credible intervals specify the current best estimated range for the probability of the parameter. As such, this Bayesian way of doing things is again more intuitive.
Super Thinking Page 18