How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)
Page 12
The public, understandably, freaked out. One general practitioner found that 12% of pill users among her patients stopped taking their contraceptives as soon as they heard the government report. Presumably, many women switched to other versions of the pill not implicated in thrombosis, but any interruption makes the pill less effective. And less-effective birth control means more pregnancies. (What—you thought I was going to say there was a wave of abstinence?) After several successive years of decline, the conception rate in the United Kingdom jumped several percentage points the next year. There were 26,000 more babies conceived in 1996 in England and Wales than there had been one year previously. Since so many of the extra pregnancies were unplanned, that led to a lot more termination, too: 13,600 more abortions than in 1995.
This might seem a small price to pay to avoid a blood clot careening through your circulatory system, wreaking potentially lethal havoc. Think about all the women who were spared from death by embolism by the CSM’s warning!
But how many women, exactly, is that? We can’t know for sure. But one scientist, a supporter of the CSM decision to issue the warning, said the total number of embolism deaths prevented was “possibly one.” The added risk posed by third-generation birth control pills, while significant in Fisher’s statistical sense, was not so significant in the sense of public health.
The way the story was framed only magnified the confusion. The CSM reported a risk ratio: third-generation pills doubled women’s risk of thrombosis. That sounds pretty bad, until you remember that thrombosis is really, really rare. Among women of childbearing age using first- and second-generation oral contraceptives, 1 in 7,000 could expect to suffer a thrombosis; users of the new pill indeed had twice as much risk, 2 in 7,000. But that’s still a very small risk, because of this certified math fact: twice a tiny number is a tiny number. How good or bad it is to double something depends on how big that something is! Playing ZYMURGY on a double word score on the Scrabble board is a triumph; hitting the same square with NOSE is a waste of a move.
Risk ratios are much easier for the brain to grasp than tiny splinters of probability like 1 in 7,000. But risk ratios applied to small probabilities can easily mislead you. A study by sociologists at CUNY found that infants cared for at in-home day cares or by nannies had a fatality rate seven times that of kids in day-care centers. But before you fire your au pair, consider for a minute that American infants hardly ever die these days, and when they do it’s almost never because a caregiver shook them to death. The annual rate of fatal accidents in home-based care was 1.6 per 100,000 babies: a lot higher, indeed, than the rate of 0.23 per 100,000 in day-care centers.* But both numbers are more or less zero. In the CUNY study, only a dozen or so babies a year died in accidents in family day cares, a tiny fraction of the 1,110 U.S. infants who died in accidents overall in 2010 (mostly by strangulation in bedclothes) or the 2,063 who died of sudden infant death syndrome. All things being equal, the results of the CUNY study provide a reason to prefer a day-care center to care in a family home; but all other things are usually not equal, and some inequalities matter more than others. What if the scrubbed and city-certified day-care center is twice as far from your house as the slightly questionable family-run in-home day care? Car accidents killed 79 infants in the U.S. in 2010; if your baby ends up spending 20% more time on the road per year thanks to the longer commute, you may have wiped out whatever safety advantage you gained by choosing the fancier day care.
A significance test is a scientific instrument, and like any other instrument, it has a certain degree of precision. If you make the test more sensitive—by increasing the size of the studied population, for example—you enable yourself to see ever-smaller effects. That’s the power of the method, but also its danger. The truth is, the null hypothesis, if we take it literally, is probably just about always false. When you drop a powerful drug into a patient’s bloodstream, it’s hard to believe the intervention has exactly zero effect on the probability that the patient will develop esophageal cancer, or thrombosis, or bad breath. Every part of the body speaks to every other, in a complex feedback loop of influence and control. Everything you do either gives you cancer or prevents it. In principle, if you carry out a powerful enough study, you can find out which it is. But those effects are usually so minuscule that they can be safely ignored. Just because we can detect them doesn’t always mean they matter.
If only we could go back in time to the dawn of statistical nomenclature and declare that a result passing Fisher’s test with a p-value of less than 0.05 was “statistically noticeable” or “statistically detectable” instead of “statistically significant”! That would be truer to the meaning of the method, which merely counsels us about the existence of an effect but is silent about its size or importance. But it’s too late for that. We have the language we have.*
THE MYTH OF THE MYTH OF THE HOT HAND
We know B. F. Skinner as a psychologist, in many ways the modern psychologist, the man who stared down the Freudians and led a competing psychology, behaviorism, concerned only with what was visible and what could be measured, requiring no hypotheses about unconscious or, for that matter, conscious motivations. For Skinner, a theory of mind just was a theory of behavior, and the interesting projects for psychologists thus did not concern thoughts or feelings at all, but rather the manipulation of behavior by means of reinforcement.
Less well known is Skinner’s history as a frustrated novelist. Skinner was an English major at Hamilton College and spent much of his time with Percy Saunders, a chemistry professor and aesthete whose house was a kind of literary salon. Skinner read Ezra Pound, and listened to Schubert, and wrote adolescently heated poems (“At night, he stops, breathless / Murmuring to his earthly consort / ‘Love exhausts me!’”) for the college literary magazine. He did not take a single psychology course. After college, Skinner attended the Bread Loaf writer’s conference, where he wrote “a one-act play about a quack who changed people’s personalities with endocrines” and succeeded in pressing several of his short stories on Robert Frost. Frost wrote Skinner a very satisfactory letter praising his stories and counseling: “All that makes a writer is the ability to write strongly and directly from some unaccountable and almost invincible personal prejudice. . . . I take it that everybody has the prejudice and spends some time feeling for it to speak and write from. But most people end as they begin by acting out the prejudices of other people.”
Thus encouraged, Skinner moved into his parents’ attic in Scranton in the summer of 1926 and set out to write. But Skinner found it was not so easy to find his own personal prejudice, or, having found it, to put it in literary form. His time in Scranton came to nothing; he managed a couple of stories and a sonnet about labor leader John Mitchell, but spent his time mostly building model ships and tuning in to distant signals from Pittsburgh and New York on the radio, then a brand-new procrastination device.
“A violent reaction against all things literary was setting in,” he later wrote of this period. “I had failed as a writer because I had nothing important to say, but I could not accept that explanation. It was literature which must be at fault.” Or, more bluntly: “Literature must be demolished.”
Skinner was a regular reader of the literary magazine The Dial; in its pages, he encountered the philosophical writings of Bertrand Russell, and via Russell was brought to John Watson, the first great advocate of the behaviorist outlook that would soon become almost synonymous with Skinner’s name. Watson held that scientists were in the business of observing the results of experiments, and only that; there was no room for hypotheses about consciousness or souls. “No one has ever touched a soul or seen one in a test-tube,” he famously wrote, by way of dismissing the notion. These uncompromising words must have thrilled Skinner, as he moved to Harvard as a graduate student in psychology, making ready to banish the vague, unruly self from the scientific study of behavior.
Skinner had been much struck by an
experience of spontaneous verbal production he’d experienced in his lab; a machine in the background was making a repetitive, rhythmic sound, and Skinner found himself talking along with it, following the beat, silently repeating the phrase “You’ll never get out, you’ll never get out, you’ll never get out.” What seemed like speech, or even, in a small way, like poetry, was actually the result of a kind of autonomous verbal process, requiring nothing like a conscious author.* This provided just the idea Skinner needed to settle his score with literature. What if language, even the language of the great poets, was just another behavior, trained by exposure to stimuli, and manipulable in the lab?
In college, Skinner had written imitations of Shakespeare’s sonnets; he retrospectively described this experience, in thoroughly behaviorist fashion, as “the strange excitement of emitting whole lines ready-made, properly scanned and rhymed.” Now, as a young psychology professor in Minnesota, he recast Shakespeare himself as more emitter than writer. This approach was not as crazy then as it seems now; the dominant form of literary criticism at the time, “close reading,” bore the mark of Watson’s philosophy just as Skinner did, displaying a very behaviorist preference for the words on the page over the unobservable intentions of the author.
Shakespeare is famous as a master of the alliterative line, in which several words in close succession start with the same sound (“Full fathom five thy father lies . . .”). For Skinner, this argument by example was no kind of science. Did Shakespeare alliterate? If he did, then math could prove it so. “Proof that there is a process responsible for alliterative patterning,” he wrote, “can be obtained only through a statistical analysis of all the arrangements of initial consonants in a reasonably large sample.” And what form of statistical analysis? None other than a form of Fisher’s p-value test. Here, the null hypothesis is that Shakespeare paid no heed to the initial sounds of words at all, so that the first letter of one word of poetry has no effect on other words in the same line. The protocol was much like that of a clinical trial, but with one big difference: the biomedical researcher testing a drug hopes with all his heart to see the null hypothesis refuted, and the effectiveness of the medicine demonstrated. For Skinner, aiming to knock literary criticism off its plinth, the null hypothesis was the attractive one.
Under the null hypothesis, the frequency with which initial sounds appeared multiple times in the same line would be unchanged if the words were put in a sack, shaken up, and laid out again in random order. And this is just what Skinner found in his sample of a hundred sonnets. Shakespeare failed the significance test. Skinner writes:
“In spite of the seeming richness of alliteration in the sonnets, there is no significant evidence of a process of alliteration in the behavior of the poet to which any serious attention should be given. So far as this aspect of poetry is concerned, Shakespeare might as well have drawn his words out of a hat.”
“Seeming richness”—what chutzpah! It captures perfectly the spirit of the psychology that Skinner wanted to create. Where Freud had claimed to see what had previously been hidden, repressed, or obscured, Skinner wanted to do the opposite—to deny the existence of what seemed in plain view.
But Skinner was wrong; he hadn’t proved that Shakespeare didn’t alliterate. A significance test is an instrument, like a telescope. And some instruments are more powerful than others. If you look at Mars with a research-grade telescope, you’ll see moons; if you look with binoculars, you won’t. But the moons are still there! And Shakespeare’s alliteration is still there. As documented by literary historians, it was a standard device of the time, known to and consciously deployed by nearly everyone writing in English.
What Skinner had proved is that Shakespeare’s alliteration did not produce a surplus of repeated sounds so great as to show up on his test. But why would it? The use of alliteration in poetry is both positive and negative; in certain places you alliterate to create an effect, and in other places you intentionally avoid it, lest you create an effect you don’t want. It may be that the overall tendency is to increase the number of alliterative lines, but even if so, the increase should be small. Stuff your sonnets with one or two extra alliterations each and you become one of the stone-footed poets mocked by Shakespeare’s fellow Elizabethan George Gascoigne: “Many writers indulge in repeticion of sundrie wordes all beginning with one letter, the whiche (beyng modestly used) lendeth good grace to a verse; but they do so hunt a letter to death, that they make it Crambe, and Crambe bis positum mors est.”
The Latin phrase means “Cabbage served twice is death.” Shakespeare’s writing is rich in effect, but always restrained. He would never pack in so much cabbage that Skinner’s crude test could smell it.
A statistical study that’s not refined enough to detect a phenomenon of the expected size is called underpowered—the equivalent of looking at the planets with binoculars. Moons or no moons, you get the same result, so you might as well not have bothered. You don’t send binoculars to do a telescope’s job. The problem of low power is the flip side to the problem of the British birth control scare. A high-powered study, like the birth control trial, may lead you to burst a vein about a small effect that isn’t actually important. An underpowered one may lead you to wrongly dismiss a small effect that your method was simply too weak to see.
Consider Spike Albrecht. The freshman guard for Michigan’s men’s basketball team, standing at just five foot eleven and a bench player most of the season, wasn’t expected to play a big role when the Wolverines faced Louisville in the 2013 NCAA final. But Albrecht made five straight shots, four of them three-pointers, in a ten-minute span in the first half, leading Michigan to a ten-point lead over the heavily favored Cardinals. He had what basketball fans call “the hot hand”—the apparent inability to miss a shot, no matter how great the distance or how fierce the defense.
Except there’s supposed to be no such thing. In 1985, in one of the most famous contemporary papers in cognitive psychology, Thomas Gilovich, Robert Vallone, and Amos Tversky (hereafter GVT) did to basketball fans what B. F. Skinner had done to lovers of the Bard. They obtained records of every shot taken by the 1980−81 Philadelphia 76ers in their forty-eight home games and analyzed them statistically. If players tended toward hot streaks and cold streaks, you might expect a player to be more likely to hit a shot following a basket than a shot following a miss. And when GVT surveyed NBA fans, they found this theory had broad support; nine out of ten fans agreed that a player is more likely to sink a shot when he’s just hit two or three baskets in a row.
But nothing of the kind was going on in Philadelphia. Julius Erving, the great Dr. J, was a 52% shooter overall. After three straight baskets, a situation that you’d think might indicate Erving was hot, his percentage went down to 48%. And after three straight misses, his field goal percentage didn’t drop, but rather stayed right at 52%. For other players, like Darryl “Chocolate Thunder” Dawkins, the effect was even more extreme. After a hit, his overall 62% shooting percentage dipped to 57%; after a miss, it shot up to 73%, exactly the opposite of the fan predictions. (One possible explanation: a missed shot suggests Dawkins was facing effective defenders on the perimeter, inducing him to drive to the basket for one of his trademark backboard-shattering dunks, which he gave names like “In Your Face Disgrace” and “Turbo Sexophonic Delight.”)
Does this mean there’s no such thing as the hot hand? Not just yet. The hot hand, after all, isn’t generally thought of as a universal tendency for hits to follow hits and misses to follow misses. It’s an evanescent thing, a brief possession by a superior basketball being that inhabits a player’s body for a short glorious interval on the court, giving no warning of its arrival or departure. Spike Albrecht is Ray Allen for ten minutes, mercilessly raining down threes, then he’s Spike Albrecht again. Can a statistical test see this? In principle, why not? GVT devised a clever way to check for these short intervals of unstoppability. They broke up each player’s seaso
n into sequences of four shots each; so if Dr. J’s sequence of hits and misses looked like
HMHHHMHMMHHHHMMH
the sequences would be
HMHH, HMHM, MHHH, HMMH . . .
GVT then counted how many of the sequences were “good” (3 or 4 hits), “moderate” (2 hits), or “bad” (0 or 1 hits) for each of the nine players in the study. And then, good Fisherians, they considered the results of the null hypothesis—that there’s no such thing as the hot hand.
There are sixteen possible sequences of four shots: the first shot can be either H or M, and for each of these options there are two possibilities for the second shot, giving us four options in all for the first two shots (here they are: HH, HM, MH, MM) and for each of these four there are two possibilities for the third shot, giving eight possible three-shot sequences, and doubling once more to account for the last shot in the sequence we get 16. Here they all are, divided into the good ones, the moderate ones, and the bad ones:
Good: HHHH, MHHH, HMHH, HHMH, HHHM
Moderate: HHMM, HMHM, HMMH, MHHM, MHMH, MMHH
Bad: HMMM, MHMM, MMHM, MMMH, MMMM
For a 50% shooter like Dr. J, all 16 possible sequences should then be equally likely, because each shot is equally likely to be an H or an M. So you’d expect about 5/16, or 31.25%, of Dr. J’s four-shot sequences to be good, with 37.5% moderate and 31.25% bad.
But if Dr. J sometimes experienced the hot hand, you might expect a higher proportion of good sequences, contributed by those games where he just can’t seem to miss. The more prone to hot and cold streaks you are, the more you’re going to see HHHH and MMMM, and the less you’re going to see HMHM.
The significance test asks us to address the following question: if the null hypothesis were correct and there were no hot hand, would we be unlikely to see the results that were actually observed? And the answer turns out to be no. The proportion of good, bad, and moderate sequences in the actual data is just about what chance would predict, any deviation falling well short of the statistically significant.