How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)
Page 16
Their scientific differences are perhaps most starkly displayed in Neyman and Pearson’s approach to the problem of inference.* How to determine the truth from the evidence? Their startling response is to unask the question. For Neyman and Pearson, the purpose of statistics isn’t to tell us what to believe, but to tell us what to do. Statistics is about making decisions, not answering questions. A significance test is no more or less than a rule, which tells the people in charge whether to approve a drug, undertake a proposed economic reform, or tart up a website.
It sounds crazy at first to deny that the goal of science is to find out what’s true, but the Neyman-Pearson philosophy is not so far from reasoning we use in other spheres. What’s the purpose of a criminal trial? We might naively say it’s to find out whether the defendant actually committed the crime they’re on trial for. But that’s obviously wrong. There are rules of evidence, which forbid the jury from hearing testimony obtained improperly, even if it might help them accurately determine the defendant’s innocence or guilt. The purpose of a court is not truth, but justice. We have rules, the rules must be obeyed, and when we say that a defendant is “guilty” we mean, if we are careful about our words, not that he committed the crime he’s accused of, but that he was convicted fair and square according to those rules. Whatever rules we choose, we’re going to let some criminals go free and imprison some of the blameless. The less you do of the first, the more you’re likely to do of the second. So we try to design the rules in whatever way society thinks we best handle that fundamental trade-off.
For Neyman and Pearson, science is like the court. When a drug fails a significance test, we don’t say, “We are quite certain the drug didn’t work,” but merely “The drug wasn’t shown to work.” And then dismiss it, just as we would a defendant whose presence at the crime scene couldn’t be established within reasonable doubt, even if every man and woman in the courthouse thinks he’s guilty as sin.
Fisher wanted none of this—for him, Neyman and Pearson stunk of pure mathematics, insisting on an austere rationalism at the expense of anything resembling scientific practice. Most judges wouldn’t have the stomach to let an obviously innocent defendant meet the hangman, even when the rules in the book require it. And most practicing scientists have no interest in following a rigid sequence of instructions, denying themselves the self-polluting satisfaction of forming an opinion about which hypotheses are actually true. In a 1951 letter to W. E. Hick, Fisher wrote:
I am a little sorry that you have been worrying yourself at all with that unnecessary portentous approach to tests of significance represented by the Neyman and Pearson critical regions, etc. In fact, I and my pupils through the world would never think of using them. If I am asked to give an explicit reason for this I should say they approach the problem entirely from the wrong end, i.e. not from the point of view of a research worker, with a basis of well grounded knowledge on which a very fluctuating population of conjectures and incoherent observations is continually under examination. What he needs is a confident answer to the question “Ought I to take notice of that?” This question can, of course, and for refinement of thought should, be framed as “Is this particular hypothesis overthrown, and if so at what level of significance, by this particular body of observations?” It can be put in this form unequivocally only because the genuine experimenter already has the answers to all the questions that the followers of Neyman and Pearson attempt, I think vainly, to answer by merely mathematical considerations.
But Fisher certainly understood that clearing the significance bar wasn’t the same thing as finding the truth. He envisions a richer, more iterated approach, writing in 1926: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”
Not “succeeds once in giving,” but “rarely fails to give.” A statistically significant finding gives you a clue, suggesting a promising place to focus your research energy. The significance test is the detective, not the judge. You know how when you read an article about a breakthrough finding that this thing causes that thing, or that thing prevents the other thing, and at the end there’s always a banal sort of quote from a senior scientist not involved in the study intoning some very minor variant of “The finding is quite interesting, and suggests that more research in this direction is needed”? And how you don’t really even read that part because you think of it as an obligatory warning without content?
Here’s the thing—the reason scientists always say that is because it’s important and it’s true! The provocative and oh-so-statistically-significant finding isn’t the conclusion of the scientific process, but the bare beginning. If a result is novel and important, other scientists in other laboratories ought to test and retest the phenomenon and its variants, trying to figure out whether the result was a one-time fluke or whether it truly meets the Fisherian standard of “rarely fails.” That’s what scientists call replication; if an effect can’t be replicated, despite repeated trials, science backs apologetically away. The replication process is supposed to be science’s immune system, swarming over newly introduced objects and killing the ones that don’t belong.
That’s the ideal, at any rate. In practice, science is a bit immunosuppressed. Some experiments, of course, are hard to repeat. If your study measures a four-year-old’s ability to delay gratification and then relates these measurements with life outcomes thirty years later, you can’t just pop out a replication.
But even studies that could be replicated often aren’t. Every journal wants to publish a breakthrough finding, but who wants to publish the paper that does the same experiment a year later and gets the same result? Even worse, what happens to papers that carry out the same experiment and don’t find a significant result? For the system to work, those experiments need to be made public. Too, often they end up in the file drawer instead.
But the culture is changing. Reformers with loud voices like Ioannides and Simonsohn, who speak both to the scientific community and to the broader public, have generated a new sense of urgency about the danger of descent into large-scale haruspicy. In 2013, the Association for Psychological Science announced that they would start publishing a new genre of article, called Registered Replication Reports. These reports, aimed at reproducing the effects reported in widely cited studies, are treated differently from usual papers in a crucial way: the proposed experiment is accepted for publication before the study is carried out. If the outcomes support the initial finding, great news, but if not, they’re published anyway, so the whole community can know the full state of the evidence. Another consortium, the Many Labs project, revisits high-profile findings in psychology and attempts to replicate them in large multinational samples. In November 2013, psychologists were cheered when the first suite of Many Labs results came back, finding that 10 of the 13 studies addressed were successfully replicated.
In the end, of course, judgments must be made, and lines drawn. What, after all, does Fisher really mean by the “rarely” in “rarely fails”? If we assign an arbitrary numerical threshold (“an effect is real if it reaches statistical significance in more than 90% of experiments”) we may find ourselves in trouble again.
Fisher, at any rate, didn’t believe in a hard and fast rule that tells us what to do. He was a distruster of pure mathematical formalism. In 1956, near the end of this life, he wrote that “in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.”
In the next chapter we will see one way in which “the light of the evidence” might be made more specific.
TEN
ARE YOU THERE, GOD? IT’S ME, BAYESIAN INFERENCE
The age of big data is frightening to a lot of people, and it’s frightening in part because of the implicit promise that algorith
ms, sufficiently supplied with data, are better at inference than we are. Superhuman powers are scary: beings that can change their shape are scary, beings that rise from the dead are scary, and beings that can make inferences that we cannot are scary. It was scary when a statistical model deployed by the Guest Marketing Analytics team at Target correctly inferred based on purchasing data that one of its customers—sorry, guests—a teenaged girl in Minnesota, was pregnant, based on an arcane formula involving elevated rates of buying unscented lotion, mineral supplements, and cotton balls. Target started sending her coupons for baby gear, much to the consternation of her father, who, with his puny human inferential power, was still in the dark. Spooky to contemplate, living in a world where Google and Facebook and your phone, and, geez, even Target, know more about you than your parents do.
But it’s possible we ought to spend less time worrying about eerily superpowered algorithms and more time worrying about crappy ones.
For one thing, crappy might be as good as it gets. Yes, the algorithms that drive the businesses of Silicon Valley get more sophisticated every year, and the data fed to them more voluminous and nutritious. There’s a vision of the future in which Google knows you; where by aggregating millions of micro-observations (“How long did he hesitate before clicking on this. . . . how long did his Google Glass linger on that. . . . ”) the central storehouse can predict your preferences, your desires, your actions, especially vis-à-vis what products you might want, or might be persuaded to want.
It might be that way! But it also might not. There are lots of mathematical problems where supplying more data improves the accuracy of the result in a fairly predictable way. If you want to predict the course of an asteroid, you need to measure its velocity and its position, as well as the gravitational effects of the objects in its astronomical neighborhood. The more measurements you can make of the asteroid and the more precise those measurements are, the better you’re going to do at pinning down its track.
But some problems are more like predicting the weather. That’s another situation where having plenty of fine-grained data, and the computational power to plow through it quickly, can really help. In 1950, it took the early computer ENIAC twenty-four hours to simulate twenty-four hours of weather, and that was an astounding feat of space-age computation. In 2008, the computation was reproduced on a Nokia 6300 mobile phone in less than a second. Forecasts aren’t just faster now; they’re longer-range and more accurate, too. In 2010, a typical five-day forecast was as accurate as a three-day forecast had been in 1986.
It’s tempting to imagine that predictions will just get better and better as our ability to gather data gets more and more powerful; won’t we eventually have the whole atmosphere simulated to a high precision in a server farm somewhere under The Weather Channel’s headquarters? Then, if you wanted to know next month’s weather, you could just let the simulation run a little bit ahead.
It’s not going to be that way. Energy in the atmosphere burbles up very quickly from the tiniest scales to the most global, with the effect that even a minuscule change at one place and time can lead to a vastly different outcome only a few days down the road. Weather is, in the technical sense of the word, chaotic. In fact, it was in the numerical study of weather that Edward Lorenz discovered the mathematical notion of chaos in the first place. He wrote, “One meteorologist remarked that if the theory were correct, one flap of a sea gull’s wing would be enough to alter the course of the weather forever. The controversy has not yet been settled, but the most recent evidence seems to favor the sea gulls.”
There’s a hard limit to how far in advance we can predict the weather, no matter how much data we collect. Lorenz thought it was about two weeks, and so far the concentrated efforts of the world’s meteorologists have given us no cause to doubt that boundary.
Is human behavior more like an asteroid or more like the weather? It surely depends on what aspect of human behavior you’re talking about. In at least one respect, human behavior ought to be even harder to predict than the weather. We have a very good mathematical model for weather, which allows us at least to get better at short-range predictions when given access to more data, even if the inherent chaos of the system inevitably wins out. For human action we have no such model and may never have one. That makes the prediction problem massively harder.
In 2006, the online entertainment company Netflix launched a $1 million competition to see if anyone in the world could write an algorithm that did a better job than Netflix’s own at recommending movies to customers. The finish line didn’t seem very far from the start; the winner would be the first program to do 10% better at recommending movies than Netflix did.
Contestants were given a huge file of anonymized ratings—about a million ratings in all, covering 17,700 movies and almost half a million Netflix users. The challenge was to predict how users would rate movies they hadn’t seen. There’s data—lots of data. And it’s directly relevant to the behavior you’re trying to predict. And yet this problem is really, really hard. It ended up taking three years before anyone crossed the 10% improvement barrier, and it was only done when several teams banded together and hybridized their almost-good-enough algorithms into something just strong enough to collapse across the finish line. Netflix never even used the winning algorithm in its business; by the time the contest was over, Netflix was already transitioning from sending DVDs in the mail to streaming movies online, which makes dud recommendations less of a big deal. And if you’ve ever used Netflix (or Amazon, or Facebook, or any other site that aims to recommend you products based on the data it’s gathered about you), you know that the recommendations remain pretty comically bad. They might get a lot better as even more streams of data get integrated into your profile. But they certainly might not.
Which, from the point of view of the companies doing the gathering, is not so bad. It would be great for Target if they knew with absolute certainty whether or not you were pregnant, just from following the tracks of your loyalty card. They don’t. But it would also be great if they could be 10% more accurate in their guesses about your gravidity than they are now. Same for Google. They don’t have to know exactly what product you want; they just have to have a better idea than competing ad channels do. Businesses generally operate on thin margins. Predicting your behavior 10% more accurately isn’t actually all that spooky for you, but it can mean a lot of money for them. I asked Jim Bennett, the vice president for recommendations at Netflix at the time of the competition, why they’d offered such a big prize. He told me I should have been asking why the prize was so small. A 10% improvement in their recommendations, small as that seems, would recoup the million in less time than it takes to make another Fast and Furious movie.
DOES FACEBOOK KNOW YOU’RE A TERRORIST?
So if corporations with access to big data are still pretty limited in what they “know” about you, what’s to worry about?
Try worrying about this. Suppose a team at Facebook decides to develop a method for guessing which of its users are likely to be involved in terrorism against the United States. Mathematically, it’s not so different from the problem of figuring out whether a Netflix user is likely to enjoy Ocean’s Thirteen. Facebook generally knows its users’ real names and locations, so it can use public records to generate a list of Facebook profiles belonging to people who have already been convicted of terroristic crimes or support of terrorist groups. Then the math starts. Do the terrorists tend to make more status updates per day than the general population, or fewer, or on this metric do they look basically the same? Are there words that appear more frequently in their updates? Bands or teams or products they’re unusually prone or disinclined to like? Putting all this stuff together, you can assign to each user a score,* which represents your best estimate for the probability that the user has ties, or will have ties, to terrorist groups. It’s more or less the same thing Target does when they cross-reference your lotion and vitamin purch
ases to estimate how likely it is that you’re pregnant.
There’s one important difference: pregnancy is very common, while terrorism is very rare. In almost all cases, the estimated probability that a given user would be a terrorist would be very small. So the result of the project wouldn’t be a Minority Report−style precrime center, where Facebook’s panoptic algorithm knows you’re going to do some crime before you do. Think of something much more modest: say, a list of a hundred thousand users about whom Facebook can say, with some degree of confidence, “People drawn from this group are about twice as likely as the typical Facebook user to be terrorists or terrorism supporters.”
What would you do if you found out a guy on your block was on that list? Would you call the FBI?
Before you take that step, draw another box.
The contents of the box are the 200 million or so Facebook users in the United States. The line between the upper and lower halves separates future terrorists, on the top, from the innocent below. Any terrorist cells in the United States are surely pretty small—let’s say, to be as paranoid as possible, that there are ten thousand people who the feds really ought to have their eye on. That’s one in twenty thousand of the total user base.