The Theory That Would Not Die
Page 18
Stein’s Paradox, however, works for comparisons between related statistics: the egg production of different chicken breeds, the batting averages of various baseball players, or the workers’ compensation exposure of roofing companies. Traditionally, for example, farmers comparing the egg production of five chicken breeds would average the egg yields of each breed separately. But what if a traveling salesman advertised a breed of hens, and each one supposedly laid a million eggs? Because of their prior knowledge of poultry, farmers would laugh him out of town. Bayesians decided that Stein, like the farmers, had weighted his average with a sort of super-or hyperdistribution about chicken-ness, information about egg laying inherent in each breed but never before considered. And intrinsic to poultry farming is the fact that one hen never lays a million eggs.
In like manner, Stein’s system used prior information to explain the hitherto mediocre batter who begins a new season hitting a spectacular .400. Stein’s Paradox tells fans not to forget their previous knowledge of the sport and the batting averages of other players.
When Stein and Willard D. James simplified the method in 1961, they produced another surprise—the same Bayes-based formula that actuaries had used earlier in the century to price workers’ compensation insurance premiums. Whitney’s actuarial Credibility theorem used x = P + z(p–P), and Stein and James used z = + c(y–y): identical formulas but with different symbols and names. In both cases, data about related quantities were being concentrated, made more credible or shrunk, until they clustered more tightly around the mean. With it, actuaries made more accurate predictions about the future well-being of workers in broad industrial categories. Only Arthur Bailey had recognized the formula’s Bayesian roots and realized it would be equally valid for noninsurance situations.
Delighted Bayesians claimed Stein was using the prior context of his numbers to shrink the range of possible answers and make for better predictions. Stein, however, continued to regard Bayes’ philosophical framework as a “negative” and subjective priors as “completely inappropriate.”6
Box, who believed Stein should have admitted his method was Bayesian, immediately thought of other relationships that work the same way. Daily egg yields produced on Monday are related to yields produced on Tuesday, Wednesday, and Thursday. In this case, the link between different items is a time series, and successive observations tend to be correlated with each other just as tomorrow tends to be rather like today. But Box discovered gleefully that analyzing time series with Bayesian methods improved predictions and that without those methods Stein’s Paradox did not work for time series. As Box explained, “If someone comes into your office with some numbers and says, ‘Analyze them,’ it’s reasonable to ask where they come from and how they’re linked. The quality that makes numbers comparable has to be taken into account. You can’t take numbers out of context.”7
Frequentists and Bayesians battled for years over Stein’s Paradox, in part because neither side seemed to be all right or all wrong. Box, however, was a convinced Bayesian, and he wrote a snappy Christmas party song to the tune of “There’s No Business like Show Business.” One verse went:
There’s no Theorem like Bayes theorem
Like no theorem we know
Everything about it is appealing
Everything about it is a wow
Let out all that a priori feeling
You’ve been concealing right up to now.
. . . . There’s no theorem like Bayes theorem
Like no theorem we know.8
As Bayesian interpretations multiplied like bunnies and cropped up in unlikely places like Stein’s Paradox, cracks formed in Fisher’s favorite theory, fiducial probability. He had introduced it as an alternative to Bayes’ rule during an argument with Karl Pearson in 1930. But in 1958 Lindley showed that when uniform priors are used, Fisher’s fiducial probability and Bayesian inference produce identical solutions.
Still another fissure occurred when Allan Birnbaum derived George A. Barnard’s likelihood principle from generally accepted frequentist principles and showed that he needed to take into account only observed data, not information that might have arisen from the experiment but had not. Another frequentist complained that Birnbaum was “propos[ing] to push the clock back 45 years, but at least this puts him ahead of the Bayesians, who would like to turn the clock back 150 years.”9 Jimmie Savage, however, praised Birnbaum’s work as a “historic occasion.”10
Savage also damned Fisher’s fiducial method for using parts of Bayes while avoiding the opprobrium attached to priors. Savage considered Fisher’s theory “a bold attempt to make the Bayesian omelet without breaking the Bayesian eggs.”11 Box thought his father-in-law’s fiducial probability was beginning to look like “a sneaky way of doing Bayes.”12
Yet another disagreement between Bayesians and anti-Bayesians surfaced in 1957, when Lindley, elaborating on a point made by Jeffreys, highlighted a theoretical situation when the two approaches produce diametrically opposite results. Lindley’s Paradox occurs when a precise hypothesis is tested with vast amounts of data. In 1987, Princeton University aeronautics engineering professor Robert G. Jahn conducted a large study which he concluded supported the existence of psychokinetic powers. He reported that a random event generator had produced 104,490,000 trials testing the hypothesis that someone on a couch eight feet away cannot influence their results any more than random chance would. Jahn reported that the random event generator produced 18,471 more examples (0.018%) of human influence on his sensitive microelectronic equipment than could be expected with chance alone. Even with a p-value as small as 0.00015, the frequentist would reject the hypothesis (and conclude in favor of psychokinetic powers) while the same evidence convinces a Bayesian that the hypothesis against spiritualism is almost certainly true.
Six years later, Jimmie Savage, Harold Lindman, and Ward Edwards at the University of Michigan showed that results using Bayes and the frequentist’s p-values could differ by significant amounts even with everyday-sized data samples; for instance, a Bayesian with any sensible prior and a sample of only 20 would get an answer ten times or more larger than the p-value.
Lindley ran afoul of Fisher’s temper when he reviewed Fisher’s third book and found “what I thought was a very basic, serious error in it: Namely, that [Fisher’s] fiducial probability doesn’t obey the rules of probability. He said it did. He was wrong; it didn’t, and I produced an example. He was furious with me.” A sympathetic colleague warned Lindley that Fisher was livid, but “it was not until the book of Fisher’s letters was published that I realized the full force of his fury. He was unreasonable; he should have admitted his mistake. Still, I was a bumptious young man and he was entitled to be a bit angry.” Lindley compounded his tactlessness by turning his discovery into a paper. The journal editor agreed that the article had to be published because it was correct, but asked Lindley if he knew what he was doing: “We would have the wrath of Fisher on our heads.”13 For the next eight months Fisher’s letters to friends included complaints about Lindley’s “rather abusive review.”14
Bayes frayed Neyman’s nerves too. When Neyman organized a symposium in Berkeley in 1960, Lindley read a paper about prior distributions that caused “the only serious, public row that I can recall having had in statistics. Neyman was furious with me in public. I was very worried, but Savage leapt to my defense and dealt with the situation, I think very well.”15
One day in the mid-1960s Box dared to tackle his irascible father-in-law about equal priors. Fisher had come to see his new granddaughter, and friends had warned Box he could get only so far with Fisher before he exploded. Walking up the hill to the University of Wisconsin in Madison, Box told the old man, “I’ll give them equal probabilities, so if I have five hypotheses, each has a probability of one-fifth.”16
Fisher responded in a rather annoyed way, as if to say, “This is what I’m going to say and then you’re going to shut up.” He declared, “Thinking that you don’t know and th
inking that the probabilities of all the possibilities are equal is not the same thing.”17 That distinction, which Box later agreed with, prevented Fisher from accepting Bayes’ rule. Like Neyman, Fisher agreed that if he ever saw a prior he could believe, he would use Bayes-Laplace. And in fact he did. Because he knew the pedigrees of his laboratory animals generations back, he could confidently specify the initial probabilities of particular crosses. For those experiments Fisher did use Bayes’ rule. Later, Box said sadly that he divorced Fisher’s daughter because she had inherited a temper much like her father’s.
A compromise between Bayesian and anti-Bayesian methods began to look attractive. The idea was to estimate the initial probabilities according to their relative frequency and then proceed with the rest of Bayes’ rule. Empirical Bayes, as it was called, seemed like a breakthrough. Egon Pearson had made an early stab at it in 1925, Turing had used a variant during the Second World War, Herbert Robbins proposed it in 1955, and when Neyman recommended it, a flurry of publications appeared. Empirical Bayesians had little impact on mainstream statistical theory, however, and almost none on applications until the late 1970s.
At the same time, others tackled one of the practical drawbacks to Bayes: its computational difficulty. Laplace’s continuous form of Bayes’ rule called for integration of functions. This could be complicated, and as the number of unknowns rose, integration problems became hopelessly difficult given the computational capabilities of the time. Jeffreys, Lindley, and David Wallace were among those who worked on developing asymptotic approximations to make the calculations more manageable.
Amid this mathematical fervor, a few practical types sat down in the 1960s to build the kind of institutional support that frequentists had long enjoyed: annual seminars, journals, funding sources, and textbooks. Morris H. DeGroot wrote the first internationally known text on Bayesian decision theory, the mathematical analysis of decision making. Arnold Zellner at the University of Chicago raised money, founded a conference series, and began testing standard economics problems one by one, solving them from both Bayesian and nonBayesian points of view. Thanks to Zellner’s influence, Savage’s subjective probability would have one of its biggest impacts in economics. The building process took decades. The International Society for Bayesian Analysis and the Bayesian section of the American Statistical Association were not formed until the early 1990s.
The excitement over Bayesian theory extended far beyond statistics and mathematics though. During the 1960s and 1970s, physicians, National Security Agency cryptanalysts, Central Intelligence Agency analysts, and lawyers also began to consider Bayesian applications in their fields.
Physicians began talking about applying Bayes to medical diagnosis in 1959, when Robert S. Ledley of the National Bureau of Standards and Lee B. Lusted of the University of Rochester School of Medicine suggested the idea. They published their article in Science because medical journals were uninterested. After reading it, Homer Warner, a pediatric heart surgeon at the Latter-day Saints Hospital and the University of Utah in Salt Lake City, developed in 1961 the first computerized program for diagnosing disease. Basing it on 1,000 children with various congenital heart diseases, Warner showed that Bayes could identify their underlying problems quite accurately. “Old cardiologists just couldn’t believe that a computer could do something better than a human,” Warner recalled.18 A few years after Warner introduced his battery of 54 tests, Anthony Gorry and Otto Barnett showed that only seven or eight were needed, as long as they were relevant to the patient’s symptoms and given one at a time in proper sequence. Few physicians used the systems, however, and efforts to computerize diagnostics lapsed.
Between 1960 and 1972 the National Security Agency educated cryptanalysts about advanced Bayesian methods with at least six articles in its in-house NSA Technical Journal. Originally classified as “Top Secret Umbra,” an overall code word for high-level intelligence, they were partially declassified at my request in 2009, although the authorship of three of the six was suppressed. (At least one of those three has the earmarks of a Jack Good article.) In another article, an agency employee named F. T. Leahy quoted the assertion in Van Nostrand’s Scientific Encyclopedia that Bayes’ theorem “has been found to be unscientific, to give rise to various inconsistencies, and to be unnecessary.” In fact, Leahy declared in 1960, Bayes is “one of the more important mathematical techniques used by cryptanalysts . . . [and] was used in almost all the successful cryptanalysis at the National Security Agency. . . . It leads to the only correct formulas for solving a large number of our cryptanalytic problems,” including those involving comparisons among a multiplicity of hypotheses.19 Still, “only a handful of mathematicians at N.S.A. know about all the ways” Bayes can be used. The articles were presumably intended to remedy the situation.
At the CIA, analysts conducted dozens of experiments with Bayes. The CIA, which must infer information from incomplete or uncertain evidence, had failed at least a dozen times to predict disastrous events during the 1960s and 1970s. Among them were North Vietnam’s intervention in South Vietnam and the petroleum price hike instituted by the Organization of Petroleum Exporting Countries (OPEC) in 1973. Agency analysts typically made one prediction and left it at that; they ignored unlikely but potentially catastrophic possibilities and failed to update their initial predictions with new evidence. Although the CIA concluded that Bayes-based analyses were more insightful, they were judged too slow. The experiments were abandoned for lack of computer power.
The legal profession reacted differently. After several suggestions that Bayes might be useful in assessing legal evidence, Professor Laurence H. Tribe of Harvard Law School published a blistering and influential article about the method in 1971. Drawing on his bachelor’s degree in mathematics, Tribe condemned Bayes and other “mathematical or pseudo-mathematical devices” because they are able to “distort—and, in some instances, to destroy—important values” by “enshrouding the [legal] process in mathematical obscurity.”20 After that, many courtroom doors slammed shut on Bayes.
The extraordinary fact about the glorious Bayesian revival of the 1950s and 1960s is how few people in any field publicly applied Bayesian theory to real-world problems. As a result, much of the speculation about Bayes’ rule was moot. Until they could prove in public that their method was superior, Bayesians were stymied.
part IV
to prove its worth
11.
business decisions
With new statistical theories cropping up almost daily during the 1960s, the paltry number of practical applications in the public arena was becoming a professional embarrassment.
Harvard’s John W. Pratt complained that Bayesians and frequentists alike were publishing “too many minor advances extracted from real or mythical problems, sanitized, rigorized, polished, and presented in pristine mathematical form and multiple forums.”1
Bayesians in particular seemed unwilling to apply their theories to real problems. Savage’s rabbit ears and 20-pound chairs were textbook showpieces, even less substantial than Egon Pearson’s chestnut foals and pipe-smoking males of 30 years before. They were “dumb,” a Bayesian at Harvard Business School protested later;2 they lacked the practical world’s ring of truth. When analyzing substantial amounts of data even diehard Bayesians preferred frequency. Lindley, already Britain’s leading Bayesian, presented a paper about grading examinations to the Harvard Business School in 1961 without mentioning Bayes; only later did he analyze a similar problem using wine statistics and Bayesian methods.
Mathematically and philosophically, the rule was simplicity itself. “You can’t have an opinion afterwards,” said Pratt, “without having had an opinion before and then updating it using the information.”3 But making a belief quantitative and precise—there was the rub.
Until their logically appealing system could prove itself as a powerful problem solver in the workaday world of statistics, Bayesians were doomed to minority status. But who would undertake a cu
mbersome, complex, technically fearsome set of calculations to explore a method that was professionally almost taboo? At the dawn of the electronic age, powerful computers were few and far between, and packaged software nonexistent. Bayesian techniques for dealing with practical problems and computers barely existed. Brawn—lots of it—would have to substitute for computer time. The faint of heart need not apply.
Nevertheless, a few hardy, energetic, and supremely inventive investigators tried to put Bayes to work for business decision makers, social scientists, and newscasters. Their exploits dramatize the formidable obstacles that faced anyone trying to use Bayes’ rule.
The first to try their luck with Bayes was an unlikely duo at Harvard Business School, Robert Osher Schlaifer and Howard Raiffa. They were polar opposites. Schlaifer was the school’s statistical expert, but, in a sign of the times, he had taken only one mathematics course in his life and was an authority on ancient Greek slavery and modern airplane engines. Raiffa was a mathematical sophisticate who became the legendary Mr. Decision Tree, an advisor to presidents and a builder of East–West rapprochement. Together they tackled the problem of turning Bayes into a tool for business decision makers.
Fortunately, Schlaifer enjoyed using his laserlike mind, hyperlogic, and outsider status to bludgeon through convention and orthodoxy. Years later, when Raiffa was asked to describe his colleague, he did so in two words, “imperious and hierarchical.”
Schlaifer was an opinionated perfectionist. Plunging into a topic, he saw nothing else. When his passion turned to bicycle racks, he persuaded a Harvard dean to install his new design on campus. Because he loved old engines, an MIT physicist volunteered to come each week to maintain his Model A Ford and hi-fi to his exacting standards. And when he studied consumer behavior, his faculty colleagues pondered instant coffee as soberly as they would have nuclear fusion. Like an autocrat atop an empire, Schlaifer bestowed nicknames on his colleagues: “Uncle Howard” for Raiffa, “Great Man Pratt” for John W. Pratt, and, most memorably, “Arturchick” for his graduate student and near namesake, Arthur Schleifer Jr. All three survived to become chaired professors at Harvard.