Book Read Free

How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)

Page 15

by Ellenberg, Jordan


  You can see from the picture that the significance test isn’t the problem. It’s doing exactly the job it’s built to do. The genes that don’t affect schizophrenia very rarely pass the test, while the genes we’re really interested in pass half the time. But the nonactive genes are so massively preponderant that the circle of false positives, while small relative to the true negatives, is much larger than the circle of true positives.

  DOCTOR, IT HURTS WHEN I P

  And it gets worse. A low-powered study is only going to be able to see a pretty big effect. But sometimes you know that the effect, if it exists, is small. In other words, a study that accurately measures the effect of a gene is likely to be rejected as statistically insignificant, while any result that passes the p < .05 test is either a false positive or a true positive that massively overstates the gene’s effect. Low power is a special danger in fields where small studies are common and effect sizes are typically modest. A recent paper in Psychological Science, a premier psychological journal, found that married women were significantly more likely to support Mitt Romney, the Republican presidential candidate, when they were in the fertile portion of their ovulatory cycle: of those women queried during their peak fertility period, 40.4% expressed support for Romney, while only 23.4% of the married women polled at infertile times were pulling the lever for Mitt.* The sample is small, just 228 women, but the difference is big, big enough that the result passes the p-value test with a score of .03.

  Which is just the problem—the difference is too big. Is it really plausible that, among married women who dig Mitt Romney, nearly half spend a large part of each month supporting Barack Obama? Wouldn’t anyone notice?

  If there’s really a political swing to the right once ovulation kicks in, it seems likely to be substantially smaller. But the relatively small size of the study means a more realistic assessment of the strength of the effect would have been rejected, paradoxically, by the p-value filter. In other words, we can be quite confident that the large effect reported in the study is mostly or entirely just noise in the signal.

  But noise is just as likely to push you in the opposite direction from the real effect as it is to tell the truth. So we’re left in the dark by a result that offers plenty of statistical significance but very little confidence.

  Scientists call this problem “the winner’s curse,” and it’s one reason that impressive and loudly touted experimental results often melt into disappointing sludge when the experiments are repeated. In a representative case, a team of scientists led by psychologist Christopher Chabris* studied thirteen single-nucleotide polymorphisms (SNPs) in the genome that had been observed in previous studies to have statistically significant correlations with IQ scores. We know that the ability to do well on IQ-type tests is somewhat heritable, so it’s not unreasonable to look for genetic markers. But when Chabris’s team tested those SNPs against IQ measures in large data sets, like the ten-thousand-person Wisconsin Longitudinal Study, every single one of these associations vanished into insignificance; if they’re real at all, they’re almost certainly too small for even a big trial to detect. Genomicists nowadays believe that heritability of IQ scores is probably not concentrated in a few smarty-pants genes, but rather accumulates from numerous genetic features, each one having a tiny effect. Which means that if you go hunting for large effects of individual polymorphisms, you’ll succeed—at the same 1-in-20 rate as do the entrail readers.

  Even Ioannidis doesn’t really think that only one in a thousand published papers is correct. Most scientific studies don’t consist of blundering around the genome at random; they test hypotheses that the researchers have some preexisting reason to think might be true, so the bottom row of the box is not quite so enormously dominant over the top. But the crisis of replicability is real. In a 2012 study, scientists at the California biotech company Amgen set out to replicate some of the most famous experimental results in the biology of cancer, fifty-three studies in all. In their independent trials, they were able to reproduce only six.

  How can this have happened? It’s not because genomicists and cancer researchers are dopes. In part, the replicability crisis is simply a reflection of the fact that science is hard and that most ideas we have are wrong—even most of those ideas that survive a first round of prodding.

  But there are practices in the world of science that make the crisis worse, and those can be changed. For one thing, we’re doing publishing wrong. Consider the profound xkcd cartoon below. Suppose you tested twenty genetic markers to see whether they were associated with some disorder of interest, and you found just one result that achieved p < .05 significance. Being a mathematical sophisticate, you’d recognize that one success in twenty is exactly what you’d expect if none of the markers had any effect, and you’d scoff at the misguided headline, just as the cartoonist intends you to.

  All the more so if you tested the same gene, or the green jelly bean, twenty times and got a statistically significant effect just once.

  But what if the green jelly bean were tested twenty times by twenty different research groups in twenty different labs? Nineteen of the labs find no statistically significant effect. They don’t write up their results—who’s going to publish the bombshell “green jelly beans irrelevant to your complexion” paper? The scientists in the twentieth lab, the lucky ones, find a statistically significant effect, because they got lucky—but they don’t know they got lucky. For all they can tell, their green-jellybeans-cause-acne theory has been tested only once, and it passed.

  If you decide what color jelly beans to eat based just on the papers that get published, you’re making the same mistake the army made when they counted the bullet holes on the planes that came back from Germany. As Abraham Wald pointed out, if you want an honest view of what’s going on, you also have to consider the planes that didn’t come back.

  This is the so-called file drawer problem—a scientific field has a drastically distorted view of the evidence for a hypothesis when public dissemination is cut off by a statistical significance threshold. But we’ve already given the problem another name. It’s the Baltimore stockbroker. The lucky scientist excitedly preparing a press release about dermatological correlates of Green Dye #16 is just like the naive investor mailing off his life savings to the crooked broker. The investor, like the scientist, gets to see the one rendition of the experiment that went well by chance, but is blind to the much larger group of experiments that failed.

  There’s one big difference, though. In science, there’s no shady con man and no innocent victim. When the scientific community file-drawers its failed experiments, it plays both parts at once. They’re running the con on themselves.

  And all this is assuming that the scientists in question are playing fair. But that doesn’t always happen. Remember the wiggle-room problem that ensnared the Bible coders? Scientists, subject to the intense pressure to publish lest they perish, are not immune to the same wiggly temptations. If you run your analysis and get a p-value of .06, you’re supposed to conclude that your results are statistically insignificant. But it takes a lot of mental strength to stuff years of work in the file drawer. After all, don’t the numbers for that one experimental subject look a little screwy? Probably an outlier, maybe try deleting that line of the spreadsheet. Did we control for age? Did we control for the weather outside? Did we control for age and the weather outside? Give yourself license to tweak and shade the statistical tests you carry out on your results, and you can often get that .06 down to a .04. Uri Simonsohn, a professor at Penn who’s a leader in the study of replicability, calls these practices “p-hacking.” Hacking the p isn’t usually as crude as I’ve made it out to be, and it’s seldom malicious. The p-hackers truly believe in their hypotheses, just as the Bible coders do, and when you’re a believer, it’s easy to come up with reasons that the analysis that gives a publishable p-value is the one you should have done in the first place.

  But eve
rybody knows it’s not really right. When they don’t think anyone’s listening, scientists call this practice “torturing the data until it confesses.” And the reliability of the results are about what you’d expect from confessions extracted by force.

  Assessing the scale of the p-hacking problem is not so easy—you can’t examine the papers that are hidden in the file drawer or were simply never written, just as you can’t examine the downed planes in Germany to see where they were hit. But you can, like Abraham Wald, make some inferences about data you can’t measure directly.

  Think again about the International Journal of Haruspicy. What would you see if you looked at every paper ever published there and recorded the p-values you found? Remember, in this case the null hypothesis is always true, because haruspicy doesn’t work; so 5% of experiments will record a p-value of .05 or below; 4% will score .04 or below; 3% will score .03 or below, and so on. Another way to say this is that the number of experiments yielding a p-value between .04 and .05 should be about the same as the number scoring between .03 and .04, between .02 and .03, and so on. If you plotted all the p-values reported in all the papers you’d see a flat graph like this:

  Now what if you looked at a real journal? Hopefully, a lot of the phenomena you’re hunting for are actually real, which makes it more likely that your experiments will get a good (which means low) p-value score. So the graph of the p-values ought to slope downward:

  Except that’s not exactly what happens in real life. In fields ranging from political science to economics to psychology to sociology, statistical detectives have found a noticeable upward slope as the p-value approaches the .05 threshold:

  That slope is the shape of p-hacking. It tells you that a lot of experimental results that belong over on the unpublishable side of the p = .05 boundary have been cajoled, prodded, tweaked, or just plain tortured until, at last, they end up just on the happy side of the line. That’s good for the scientists who need publications, but it’s bad for science.

  What if an author refuses to torture the data, or the torture fails to deliver the desired result, and the p-value stays stuck just above the all-important .05? There are workarounds. Scientists will twist themselves into elaborate verbal knots trying to justify reporting a result that doesn’t make it to statistical significance: they say the result is “almost statistically significant,” or “leaning toward significance,” or “well-nigh significant,” or “at the brink of significance,” or even, tantalizingly, that it “hovers on the brink of significance.”* It’s easy to make fun of the anguished researchers who resort to such phrases, but we should be hating on the game, not the players—it’s not their fault that publication is conditioned on an all-or-nothing threshold. To live or die by the .05 is to make a basic category error, treating a continuous variable (how much evidence do we have that the drug works, the gene predicts IQ, fertile women like Republicans?) as if it were a binary one (true or false? yes or no?). Scientists should be allowed to report statistically insignificant data.

  In some settings, they may even be compelled to. In a 2010 opinion, the U.S. Supreme Court ruled unanimously that Matrixx, the maker of the cold remedy Zicam, was required to reveal that some users of its product had suffered anosmia, a loss of the sense of smell. The court’s opinion, written by Sonia Sotomayor, held that even though the reports of anosmia didn’t pass the signficance test, they still contributed to the “total mix” of information investors in a company can reasonably expect to have available. A result with a weak p-value may provide only a little evidence, but a little is better than none; a result with a strong p-value might provide more evidence, but as we’ve seen, it’s far from a certification that the claimed effect is real.

  There is nothing special, after all, about the value .05. It’s purely arbitrary, a convention chosen by Fisher. There’s value in convention; a single threshold, agreed on by all, ensures that we know what we’re talking about when we say the word “significant.” I once read a paper by Robert Rector and Kirk Johnson of the conservative Heritage Foundation complaining that a rival team of scientists had falsely claimed that abstinence pledges made no difference in teen rates of sexually transmitted diseases. In fact, the teens in the study who’d pledged to wait for their wedding night did have a slightly lower rate of STDs than the rest of the sample, but the difference wasn’t statistically significant. The Heritagists had a point; the evidence that pledges worked was weak, but not entirely absent.

  On the other hand, Rector and Johnson write in another paper, concerning a statistically insignificant relationship between race and poverty that they wish to dismiss, “If a variable is not statistically significant, it means that the variable has no statistically discernable difference between the coefficient value and zero, so there is no effect.” What’s good for the abstinent goose is good for the racially charged gander! The value of convention is that it enforces some discipline on researchers, guarding them from the temptation to let their own preferences determine which results count and which don’t.

  But a conventional boundary, obeyed long enough, can be easily mistaken for an actual thing in the world. Imagine if we talked about the state of the economy this way! Economists have a formal definition of a “recession,” which depends on arbitrary thresholds just as “statistical signficance” does. One doesn’t say, “I don’t care about the unemployment rate, or housing starts, or the aggregate burden of student loans, or the federal deficit; if it’s not a recession, we’re not going to talk about it.” One would be nuts to say so. The critics—and there are more of them, and they are louder, each year—say that a great deal of scientific practice is nuts in just this way.

  THE DETECTIVE, NOT THE JUDGE

  It’s clear that it’s wrong to use “p < .05” as a synonym for “true” and “p > .05” to mean “false.” Reductio ad unlikely, intuitively appealing as it is, just doesn’t work as a principle for inferring the scientific truth underlying the data.

  But what’s the alternative? If you’ve ever run an experiment, you know scientific truth doesn’t pop out of the clouds blowing a flaming trumpet at you. Data is messy, and inference is hard.

  One simple and popular strategy is to report confidence intervals in addition to p-values. This involves a slight widening of conceptual scope, asking us to consider not only the null hypothesis but a whole range of alternatives. Perhaps you operate an online store that sells artisanal pinking shears. Being a modern person (except insofar as you make artisanal pinking shears) you set up an A/B test, where half your users see the current version of your website (A) and half see a revamped version (B) with an animated pair of shears that does a little song and dance on top of the “Buy Now” button. And you find that purchases go up 10% with option B. Great! Now, if you’re a sophisticated type, you might be worried about whether this increase was merely a matter of random fluctuation—so you compute a p-value, finding that the chance of getting a result this good if the redesign weren’t actually working (i.e., if the null hypothesis were correct) is a mere 0.03.*

  But why stop there? If I’m going to pay a college kid to superimpose dancing cutlery on all my pages, I want to know not only whether it works, but how well. Is the effect I saw consistent with the hypothesis that the redesign, in the long term, is really only improving my sales by 5%? Under that hypothesis, you might find that the probability of observing 10% growth is much more likely, say 0.2. In other words, the hypothesis that the redesign is 5% better is not ruled out by the reductio ad unlikely. On the other hand, you might optimistically wonder whether you got unlucky, and the redesign was actually making your shears 25% more appealing. You compute another p-value and get 0.01, unlikely enough to induce you to throw out that hypothesis.

  The confidence interval is the range of hypotheses that the reductio doesn’t demand that you trash, the ones that are reasonably consistent with the outcome you actually observed. In this case, the confidence interval might be the ra
nge from +3% to +17%. The fact that zero, the null hypothesis, is not included in the confidence interval is just to say that the results are statistically significant in the sense we described earlier in the chapter.

  But the confidence interval tells you a lot more. An interval of [+3%, +17%] licenses you to be confident that the effect is positive, but not that it’s particularly large. An interval of [+9%, +11%], on the other hand, suggests much more strongly that the effect is not only positive but sizable.

  The confidence interval is also informative in cases where you don’t get a statistically significant result—that is, where the confidence interval contains zero. If the confidence interval is [−0.5%, 0.5%], then the reason you didn’t get statistical significance is because you have good evidence the intervention doesn’t do anything. If the confidence interval is [−20%, 20%], the reason you didn’t get statistical significance is because you have no idea whether the intervention has an effect, or in which direction it goes. Those two outcomes look the same from the viewpoint of statistical significance, but have quite different implications for what you should do next.

  The development of the confidence interval is generally ascribed to Jerzy Neyman, another giant of early statistics. Neyman was a Pole who, like Abraham Wald, started as a pure mathematician in Eastern Europe before taking up the then-new practice of mathematical statistics and moving to the West. In the late 1920s, Neyman began collaborating with Egon Pearson, who had inherited from his father Karl both an academic position in London and a bitter academic feud with R. A. Fisher. Fisher was a difficult type, always ready for a fight, about whom his own daughter said, “He grew up without developing a sensitivity to the ordinary humanity of his fellows.” In Neyman and Pearson he found opponents sharp enough to battle him for decades.

 

‹ Prev