Book Read Free

How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)

Page 11

by Ellenberg, Jordan


  DEFEATING THE NULL

  We’ve been circling around the fundamental question: How surprised should I be by what I see in the world? This is a book about math, and you must suspect that there’s a numerical way to get at this. There is. But it is fraught with danger. We need to talk about p-values.

  But first we need to talk about improbability, about which we’ve been unacceptably vague so far. There’s a reason for that. There are parts of math, like geometry and arithmetic, that we teach to children and that children, to some extent, teach themselves. Those are the parts that are closest to our native intuition. We are born almost knowing how to count, and how to categorize objects by their location and shape, and the formal, mathematical renditions of these concepts are not so different from the ones we start with.

  Probability is different. We certainly have built-in intuition for thinking about uncertain things, but it’s much harder to articulate. There’s a reason that the mathematical theory of probability came so late in mathematical history, and appears so late in the math curriculum, when it appears at all. When you try to think carefully about what probability means, you get a little woozy. When we say, “The probability that a flipped coin will land heads is 1/2,” we’re invoking the Law of Large Numbers from chapter 4, which says that if you flip the coin many, many times, the proportion of heads will almost inevitably approach 1/2, as if constrained by a narrowing channel. This is what’s called the frequentist view of probability.

  But what can we mean when we say, “The probability that it will rain tomorrow is 20%”? Tomorrow only happens once; it’s not an experiment we can repeat like a coin flip again and again. With some effort, we can shoehorn the weather into the frequentist model; maybe we mean that among some large population of days with conditions similar to this one, the following day was rainy 20% of the time. But then you’re stuck when asked, “What’s the probability that the human race will go extinct in the next thousand years?” This is, almost by definition, an experiment you can’t repeat. We use probability even to talk about events that cannot possibly be thought of as subject to chance. What’s the probability that consuming olive oil prevents cancer? What’s the probability that Shakespeare was the author of Shakespeare’s plays? What’s the probability that God wrote the Bible and cooked up the earth? It’s hard to license talking about these things in the same language we use to assess the outcomes of coin flips and dice rolls. And yet—we find ourselves able to say, of questions like this, “It seems improbable” or “It seems likely.” Once we’ve done so, how can we resist the temptation to ask, “How likely?”

  It’s one thing to ask, another to answer. I can think of no experiment that directly assesses the likelihood that the Man Upstairs actually is Upstairs (or is a Man, for that matter). So we have to do the next best thing—or at least, what traditional statistical practice holds to be the next best thing. (As we’ll see, there’s controversy on this point.)

  We said it was improbable that the names of medieval rabbis are hidden in the letters of the Torah. But is it? Many religious Jews start from the view that everything there is to know is contained, somehow or other, in the Torah’s words. If that’s the case, the presence of the rabbis’ names and birthdays there is not improbable at all; indeed, it’s almost required.

  You can tell a similar story about the North Carolina lottery. It sounds improbable that an identical set of winning numbers would come up twice in a single week. And that’s true, if you agree with the hypothesis that the numbers are drawn from the cage completely at random. But maybe you don’t. Maybe you think the randomization system is malfunctioning, and the numbers 4, 21, 23, 34, 39 are more likely to come up than others. Or maybe you think a corrupt lottery official is picking the numbers to match his own favorite ticket. Under either of those hypotheses, the amazing coincidence is not improbable at all. Improbability, as described here, is a relative notion, not an absolute one; when we say an outcome is improbable, we are always saying, explicitly or not, that it is improbable under some set of hypotheses we’ve made about the underlying mechanisms of the world.

  Many scientific questions can be boiled down to a simple yes or no: Is something going on, or not? Does a new drug make a dent in the illness it proposes to cure, or does it do nothing? Does a psychological intervention make you happier/peppier/sexier or does it do nothing at all? The “does nothing” scenario is called the null hypothesis. That is, the null hypothesis is the hypothesis that the intervention you’re studying has no effect. If you’re the researcher who developed the new drug, the null hypothesis is the thing that keeps you up at night. Unless you can rule it out, you don’t know whether you’re on the trail of a medical breakthrough or just barking up the wrong metabolic pathway.

  So how do you rule it out? The standard framework, called the null hypothesis significance test, was developed in its most commonly used form by R. A. Fisher, the founder of the modern practice of statistics,* in the early twentieth century.

  It goes like this. First, you have to run an experiment. You might start with a hundred subjects, then randomly select half to receive your proposed wonder drug while the other half gets a placebo. Your hope, obviously, is that the patients on the drug will be less likely to die than the ones getting the sugar pill.

  From here, the protocol might seem simple: if you observe fewer deaths among the drug patients than the placebo patients, declare victory and file a marketing application with the FDA. But that’s wrong. It’s not enough that the data be consistent with your theory; they have to be inconsistent with the negation of your theory, the dreaded null hypothesis. I may assert that I possess telekinetic abilities so powerful that I can drag the sun out from beneath the horizon—if you want proof, just go outside at about five in the morning and see the results of my work! But this kind of evidence is no evidence at all, because, under the null hypothesis that I lack psychic gifts, the sun would come up just the same.

  Interpreting the result of a clinical trial requires similar care. Let’s make this numerical. Suppose we’re in null hypothesis land, where the chance of death is exactly the same (say, 10%) for the fifty patients who got your drug and the fifty who got the placebo. But that doesn’t mean that five of the drug patients die and five of the placebo patients die. In fact, the chance that exactly five of the drug patients die is about 18.5%; not very likely, just as it’s not very likely that a long series of coin tosses would yield precisely as many heads as tails. In the same way, it’s not very likely that exactly the same number of drug patients and placebo patients expire during the course of the trial. I computed:

  13.3% chance equally many drug and placebo patients die

  43.3% chance fewer placebo patients than drug patients die

  43.3% chance fewer drug patients than placebo patients die.

  Seeing better results among the drug patients than the placebo patients says very little, since this isn’t at all unlikely even under the null hypothesis that your drug doesn’t work.

  But things are different if the drug patients do a lot better. Suppose five of the placebo patients die during the trial, but none of the drug patients do. If the null hypothesis is right, both classes of patients should have a 90% chance of survival. But in that case, it’s highly unlikely that all fifty of the drug patients would survive. The first of the drug patients has a 90% chance; now the chance that not only the first but also the second patient survives is 90% of that 90%, or 81%—and if you want the third patient to survive as well, the chance of that happening is only 90% of that 81%, or 72.9%. Each new patient whose survival you stipulate shaves a little off the chances, and by the end of the process, where you’re asking about the probability that all fifty will survive, the slice of probability that remains is pretty slim:

  (0.9) × (0.9) × (0.9) × . . . fifty times! . . . × (0.9) × (0.9) = 0.00515 . . .

  Under the null hypothesis, there’s only one chance in two hundred of getting
results this good. That’s much more compelling. If I claim I can make the sun come up with my mind, and it does, you shouldn’t be impressed by my powers; but if I claim I can make the sun not come up, and it doesn’t, then I’ve demonstrated an outcome very unlikely under the null hypothesis, and you’d best take notice.

  —

  So here’s the procedure for ruling out the null hypothesis, in executive bullet-point form:

  Run an experiment.

  Suppose the null hypothesis is true, and let p be the probability (under that hypothesis) of getting results as extreme as those observed.

  The number p is called the p-value. If it is very small, rejoice; you get to say your results are statistically significant. If it is large, concede that the null hypothesis has not been ruled out.

  How small is “very small”? There’s no principled way to choose a sharp dividing line between what is significant and what is not; but there’s a tradition, which starts with Fisher himself and is now widely adhered to, of taking p = 0.05, or 1/20, to be the threshold.

  Null hypothesis significance testing is popular because it captures our intuitive way of reasoning about uncertainty. Why do we find the Bible codes compelling, at least at first glance? Because codes like the ones Witztum uncovered are very unlikely under the null hypothesis that the Torah doesn’t know the future. The value of p—the likelihood of finding so many equidistant letter sequences, so accurate in their demographic profiling of notable rabbis—is very close to 0.

  Versions of this argument for divine creation predate Fisher’s formal development by a great while. The world is so richly structured and so perfectly ordered—how tremendously unlikely it would be for there to be a world like this one, under the null hypothesis that there’s no primal designer who put the thing together!

  The first person to have a go at making this argument mathematical was John Arbuthnot, royal physician, satirist, correspondent of Alexander Pope, and part-time mathematician. Arbuthnot studied the records of children born in London between 1629 and 1710, and found there a remarkable regularity: in every single one of those eighty-two years, more boys were born than girls. What are the odds, Arbuthnot asked, that such a coincidence could arise, under the null hypothesis that there was no God and all was random chance? Then the probability in any given year that London would welcome more boys than girls would be 1/2; and the p-value, the probability of the boys winning eighty-two times in a row, is

  (1/2) × (1/2) × (1/2) ×. . . 82 times . . . × (1/2)

  or a little worse than 1 in 4 septillion. In other words, more or less zero. Arbuthnot published his findings in a paper called “An Argument for Divine Providence, Taken from the Constant Regularity Observed in the Births of Both Sexes.”

  Arbuthnot’s argument was widely praised and repeated by clerical worthies, but other mathematicians quickly pointed to flaws in his reasoning. Chief among them was the unreasonable specificity of his null hypothesis. Arbuthnot’s data certainly puts the boot to the hypothesis that the sex of children is determined at random, with each child having an equal chance of being born male or female. But why should the chance be equal? Nicholas Bernoulli proposed a different null hypothesis: that the sex of a child is determined by chance, with an 18/35 chance of being a boy and 17/35 of being a girl. Bernoulli’s null hypothesis is just as atheistic as Arbuthnot’s, and it fits the data perfectly. If you flip a coin 82 times and get 82 heads, you ought to be thinking, “Something is biased about this coin,” not “God loves heads.”*

  Though Arbuthnot’s argument wasn’t widely accepted, its spirit carried on. Arbuthnot is intellectual father not only to the Bible coders but to the “creation scientists,” who argue, even today, that mathematics demands there must be a god, on the grounds that a godless world would be highly unlikely to look like the one we have.*

  But significance testing is not restricted to theological apologetics. In some sense, Darwin, the creation scientists’ shaggy godless devil, made arguments of substantially the same form on behalf of his own work:

  It can hardly be supposed that a false theory would explain, in so satisfactory a manner as does the theory of natural selection, the several large classes of facts above specified. It has recently been objected that this is an unsafe method of arguing; but it is a method used in judging of the common events of life, and has often been used by the greatest natural philosophers.

  In other words: if natural selection were false, think how unlikely it would be to encounter a biological world so thoroughly consistent with its predictions!

  The contribution of R. A. Fisher was to make significance testing into a formal endeavor, a system by which the significance, or not, of an experimental result was a matter of objective fact. In the Fisherian form, the null hypothesis significance test has been a standard method for assessing the results of scientific research for nearly a century. A standard textbook calls the method “the backbone of psychological research.” It’s the standard by which we separate experiments into successes and failures. Every time you encounter the results of a medical, psychological, or economic research study, you’re very likely reading about something that was vetted by a signficance test.

  But the unease Darwin noted about this “unsafe method of arguing” has never really receded. For almost as long as the method has been standard, there have been people who branded it a colossal mistake. Back in 1966, the psychologist David Bakan wrote about the “crisis of psychology,” which in his view was a “crisis in statistical theory”:

  The test of significance does not provide the information concerning psychological phenomena characteristically attributed to it . . . a great deal of mischief has been associated with its use. . . . To say it “out loud” is, as it were, to assume the role of the child who pointed out that the emperor was really outfitted only in his underwear.

  And here we stand, almost fifty years later, with the emperor still in office and still cavorting in the same birthday suit, despite the ever larger and more clamorous group of children broadcasting the news about his state of undress.

  THE INSIGNIFICANCE OF SIGNIFICANCE

  What’s wrong with significance? To start with, there’s the word itself. Mathematics has a funny relationship with the English language. Mathematical research articles, sometimes to the surprise of outsiders, are not predominantly composed of numerals and symbols; math is made of words. But the objects we refer to are often entities uncontemplated by the editors at Merriam-Webster. New things require new vocabulary. There are two ways to go. You can cut new words from fresh cloth, as we do when we speak of cohomology, syzygies, monodromy, and so on; this has the effect of making our work look forbidding and unapproachable. More commonly, we adapt existing words for our own purposes, based on some perceived resemblance between the mathematical object to be described and a thing in the so-called real world. So a “group,” to a mathematician, is indeed a group of things, but a very special kind of group, like the group of whole numbers or the group of symmetries of a geometric figure; we mean by it not just an arbitrary collection of things, like OPEC or ABBA, but rather a collection of things with the property that any pair of them can be combined into a third, as a pair of numbers can be added, or a pair of symmetries can be carried out one after the other.* So too for schemes, bundles, rings, and stacks, mathematical objects which stand in only the most tenuous relation to the ordinary things referred to by those words. Sometimes the language we choose has a pastoral flavor: modern algebraic geometry, for instance, is largely concerned with fields, sheaves, kernels, and stalks. Other times it’s more aggressive—it is not at all unusual to speak of an operator killing something, or, for a little more va-voom, annihilating it. I once had an uneasy moment with a colleague in an airport when he made the remark, unexceptional in a mathematical context, that it might be necessary to blow up the plane at one point.

  So: significance. In common language it means something like “
important” or “meaningful.” But the significance test that scientists use doesn’t measure importance. When we’re testing the effect of a new drug, the null hypothesis is that there is no effect at all; so to reject the null hypothesis is merely to make a judgment that the effect of the drug is not zero. But the effect could still be very small—so small that the drug isn’t effective in any sense that an ordinary non-mathematical Anglophone would call significant.

  The lexical double booking of “significance” has consequences beyond making scientific papers hard to read. On October 18, 1995, the UK Committee on Safety of Medicines (CSM) issued a “Dear Doctor” letter to nearly 200,000 doctors and public health workers around Great Britain, with an alarming warning about certain brands of “third-generation” oral contraceptives. “New evidence has become available,” the letter read, “indicating that the chance of a thrombosis occurring in a vein is increased around two-fold for some types of pill compared with others.” A venous thrombosis is no joke; it means a clot is impeding the flow of the blood through the vein. If the clot breaks free, the bloodstream can carry it all the way to your lung, where, under its new identity as a pulmonary embolism, it can kill you.

  The Dear Doctor letter was quick to assure readers that oral contraception was safe for most women, and no one should stop taking the pill without medical advice. But details like that are easy to lose when the top-line message is “Pills kill.” The AP story that ran October 19 led with “The government warned Thursday that a new type of birth control pill used by 1.5 million British women may cause blood clots. . . . It considered withdrawing the pills but decided not to, partly because some women cannot tolerate any other kind of pills.”

 

‹ Prev