The Theory That Would Not Die
Page 8
Egon Pearson and a Polish mathematician, Jerzy Neyman, teamed up in 1933 to develop the Neyman-Pearson theory of hypothesis testing. Until then, statisticians had tested one hypothesis at a time and either accepted or rejected it without considering alternatives. Egon Pearson’s idea was that the only correct reason for rejecting a statistical hypothesis was to accept a more probable one. As he, Neyman, and Fisher developed it, the theory became one of the twentieth century’s most influential pieces of applied mathematics. Egon Pearson was afraid of contradicting his father, however. His “fears of K.P. and R.A.F.” precipitated a psychological crisis in 1925 and 1926: “I had to go through the painful stage of realizing that K.P. could be wrong . . . and I was torn between conflicting emotions: a. finding it difficult to understand R. A. F., b. hating him for his attacks on my paternal ‘god,’ c. realizing that in some things at least he was right.”36 To placate his father, Egon gave up the woman he loved; they married many years later. And he was so afraid to submit articles to his father’s Biometrika that he and Neyman published their own journal, Statistical Research Memoirs, for two years between 1936 and 1938 and ceased publication only after Karl Pearson died.
Over the years Fisher, Egon Pearson, and Neyman would develop a host of powerful statistical techniques. Fisher and Neyman became fervent anti-Bayesians who limited themselves to events that could theoretically be repeated many times; regarded samples as their only source of information; and viewed each new set of data as a separate problem, to be used if the data were powerful enough to provide statistically significant conclusions and discarded if not. As anti-Bayesians, they banned subjective priors, although they did not argue with Bayes’ theorem when the priors were known; the difficulties and controversies arose when the prior probabilities were unknown. Neyman, for example, denounced Bayes’ equal-prior shortcut as “illegitimate.”37
In addition, in a deep philosophical divide between the two methods, frequentists asked for the probability of a data set given full knowledge of its probable causes, while Bayesians could ask for better knowledge of the causes in light of the data. Bayesians could also consider the probability of a single event, like rain tomorrow; encapsulate subjective information in priors; update their hunches with new information; and include every datum possible because each one might change the answer by a small amount.
In time, however, Fisher and Neyman also split asunder, starting yet another juicy 30-year feud. Their views on testing, which could be an order of magnitude apart, formed the crux of their bitter fight. According to Neyman, though, the argument began because Fisher demanded that Neyman lecture only from Fisher’s book. When Neyman refused, Fisher promised to oppose him “in all my capacities.”
In an argument during a meeting of the Royal Society on March 28, 1934, a secretary took the customary word-for-word notes for verbatim publication. Neyman presented a paper arguing that the Latin square (a technique invented by Fisher for experimental design) was biased. Fisher immediately marched to a blackboard, drew a Latin square and, using a simple argument, showed that Neyman was wrong. But Fisher was far from polite. He complained sarcastically that “he had hoped that Dr. Neyman’s paper would be on a subject with which the author was fully acquainted, and on which he could speak with authority. . . . Dr. Neyman had been somewhat unwise in his choice of topics.” Fisher couldn’t seem to stop. He kept on: “Dr. Neyman had arrived or thought he had arrived at . . . Apart from its theoretical defects . . . the apparent inability to grasp the very simple argument . . . How had Dr. Neyman been led by his symbolism to deceive himself on such a simple question?” and on and on.38
By 1936 the feud between the Neyman camp and the Fisherites was an academic cause célèbre. The two groups occupied different floors of the same building at University College London but they never mixed. Neyman’s group met in the common room for India tea between 3:30 and 4:15 p.m. Fisher’s group sipped China tea from then on. They were fighting over scraps. The statistics building had no potable water, so little electricity that blackboards were unreadable after dark, and so little heat in winter that overcoats were worn inside.
George Box, who straddled both groups (he studied under Egon Pearson, became a Bayesian, and married one of Fisher’s daughters), said that Fisher and Neyman “could both be very nasty and very generous at times.” Because Neyman was decision-oriented and Fisher was more interested in scientific inference, their methodologies and types of applications were different. Each was doing what was best for the problems he was working on, but neither side made any attempt to understand what the other was doing. A popular in-house riddle described the situation: “What’s the collective noun for a group of statisticians?” “A quarrel.”39
Shortly before the Second World War, Neyman moved to the University of California at Berkeley and transformed it into an anti-Bayesian powerhouse. The Neyman-Pearson theory of tests became the Berkeley school’s glory and emblem. The joke, of course, was that the University of Berkeley’s namesake, the Bishop Berkeley, had disapproved of calculus and mathematicians.
The golden age of probability theory had turned into a double-fronted attack by two camps of warring frequentists united in their abhorrence of Bayes. In the maelstrom, the lack of reasoned discourse among the leaders of statistical mathematics delayed the development of Bayes’ rule for decades. Caught in the infighting, the rule was left to find its way alone, stymied and disparaged.
Yet even as the frequentists’ assault laid it low, the first glimmerings of a revival flickered here and there, quietly and almost unnoticed. In a remarkable confluence of thinking, three men in three different countries independently came up with the same idea about Bayes: knowledge is indeed highly subjective, but we can quantify it with a bet. The amount we wager shows how much we believe in something.
In 1924 a French mathematician, Émile Borel, concluded that a person’s subjective degree of belief could be measured by the amount he was willing to bet. Borel argued that applying probability to real problems, such as insurance, biology, agriculture, and physics, was far more important than mathematical theorizing. He believed in rational behavior and lived as he taught. At the height of a scandal over Marie Curie’s possible affair with another scientist, Borel sheltered her and her daughters; in reaction, the minister of public instruction threatened to fire him from his teaching post at the École Normale Supérieur, a leading school of mathematics and science.40 Between the two world wars, Borel was a member of the French Chamber of Deputies and a minister of the navy and helped direct national policy toward research and education. Imprisoned briefly during the Second World War by the pro-Nazi Vichy government, he later received the Resistance Medal.
Two years after Borel’s suggestion a young English mathematician and philosopher named Frank P. Ramsey made the same suggestion. Before he died at the age of 26 following surgeries for jaundice, he wondered how we should make decisions in the face of uncertainty. In an informal talk to students at the Moral Sciences Club at Cambridge University in 1926 Ramsey suggested that probability was based on personal beliefs that could be quantified by a wager. Such extreme subjectivity broke radically with previous thinkers like Mill, who had denounced subjective probabilities as the abhorrent quantification of ignorance.
Ramsey, who in his brief career also contributed to economics, logic, and philosophy, believed uncertainty had to be described in terms of probability rather than by tests and procedures. By talking about a measure of belief as a basis for action and by introducing a utility function and the maximization of expected utility, he showed how to act in the face of uncertainty. Neither Bayes nor Laplace had ventured into the world of decisions and behavior. Because Ramsey worked in Cambridge, England, the history of Bayes’ rule might have been quite different had he lived longer.
At almost the same time as Borel and Ramsey, an Italian actuary and mathematics professor, Bruno de Finetti, also suggested that subjective beliefs could be quantified at the racetrack. He called it “the art
of guessing.”41 De Finetti had to deliver his first important paper in Paris because the most powerful Italian statistician, Corrado Gini, regarded his ideas as unsound. (In Gini’s defense, de Finetti told colleagues he was convinced Gini had “the evil eye.”)42 De Finetti, considered the finest Italian mathematician of the twentieth century, wrote about financial economics and is credited with putting Bayes’ subjectivity on a firm mathematical foundation.
Not even probability experts, however, noticed these outbursts of subjective betting. During the 1920s and 1930s the anti-Bayesian trio of Fisher, Egon Pearson, and Neyman attracted all the attention. Ramsey, Borel, and de Finetti worked outside English-speaking statistical circles.
Another outsider carved out a safe haven for Bayes in paternity law, a small and inconspicuous corner of the American judicial system. Paternity law asks, Did this man father this child? And if so, how much should he pay in child support? In 1938 a Swedish professor of genetics and psychiatry named Erik Essen-Möller developed an index of probability that was mathematically equivalent to Bayes’ theorem. For 50 years, until DNA profiling became available, American lawyers used the Essen-Möller index without knowing its Bayesian paternity. In the U.S. Uniform Parentage Act, Bayes even became a model for state legislation. Because paternity lawyers began by assigning 50–50 odds to the man’s innocence, the index favored fathers even though Essen-Möller believed that “mothers more frequently accuse true than false fathers.”43 Bayesian paternity law was also used in immigration and inheritance cases and in cases where a child was born as a result of rape. Today, DNA evidence typically gives paternity probabilities of 0.999 or more.
Yet another outsider, Lowell J. Reed, a medical researcher at Johns Hopkins University in Baltimore, dramatized the shortcomings of frequentism and the value of Bayes in 1936. Reed, a member of the department of biostatistics, wanted to determine the X-ray dosages that would kill cancerous tumors but leave patients unharmed. He had no precise exposure records, however, and the effects of low doses were not understood. Reed normally used frequency methods and repeated tests on fruit flies, protozoa, and bacteria; but to ascertain doses for humans he would have to use expensive mammals. With Bayes, he determined the most therapeutic doses for human cancer patients by sacrificing a relatively small number of cats, 27. But Reed worked outside the statistical mainstream, used Bayes only occasionally, and had little influence on statistics. Even Ramsey, Borel, de Finetti, and Essen-Möller had to wait decades before the importance of their work was recognized.
It was a geophysicist, Harold Jeffreys, who almost singlehandedly kept Bayes alive during the anti-Bayesian onslaught of the 1930s and 1940s. Cambridge University students liked to joke that they had the world’s two greatest statisticians, although one was a professor of astronomy and the other a professor of genetics. Fisher was the geneticist. Jeffreys was an Earth scientist who studied earthquakes, tsunamis, and tides. He said he qualified for the astronomy department “because Earth is a planet.”44
Thanks in large part to Jeffreys’s quiet, gentlemanly personality he and Fisher became friends even though they disagreed, irrevocably and vociferously, over Bayes. Jeffreys said he told Fisher that “on most things we should agree and when we disagreed, we would both be doubtful. After that, Fisher and I were great friends.”45 For example, Jeffreys believed that Fisher’s maximum likelihood method was basically Bayesian, and he often used it because with large samples the prior did not matter and the two techniques produced approximately the same results. They differed, however, when small amounts of data were involved. Years later others would dramatize situations where Jeffreys’s and Fisher’s significance test results can differ by an order of magnitude.
Aside from their views on Bayes, Jeffreys and Fisher had much in common. Both were practicing scientists who manipulated statistical data; neither was a mathematician or statistician. Both were educated at Cambridge; Jeffreys, in fact, never left and was a fellow there for 75 years, longer than any other professor. Neither man was outgoing; both were appalling lecturers, their feeble voices inaudible beyond a few rows, and a student once counted Jeffreys mumbling “er” 71 times in five minutes. Both were knighted for their work.
Of the two, Jeffreys led the richer personal life. At the age of 49 he married his longtime collaborator, the mathematician Bertha Swirles; they proofread their monumental Methods of Mathematical Physics during all-night stints as air raid wardens during the Second World War. He enjoyed annotating discrepancies in detective novels, singing tenor in choral societies, botanizing, walking, traveling, and, until he was 91, bicycling to work.
Like Laplace, Jeffreys studied the formation of the Earth and planets in order to understand the origin of the solar system. He became involved in statistics because he was interested in how earthquake waves travel through the Earth. A major earthquake generates seismic waves that can be recorded thousands of miles away. By measuring their arrival times at different stations, Jeffreys could work back to determine the earthquake’s likely epicenter and the likely composition of the Earth. It was a classic problem in the inverse probability of causes. In 1926 Jeffreys inferred that Earth’s central core is liquid—probably molten iron, probably mixed with traces of nickel.
As one historian said, “Perhaps in no other field were as many remarkable inferences drawn from so ambiguous and indirect data.”46 Signals were often difficult to interpret, and seismograph machines differed greatly. Earthquakes, which often occurred far apart under very different conditions, were hardly repeatable. Jeffreys’s conclusions involved far more uncertainty than Fisher’s breeding experiments, which were designed to answer precise, repeatable questions. Like Laplace, Jeffreys spent a lifetime updating his observations with new results. He wrote, “The propositions that are in doubt . . . constitute the most interesting part of science; every scientific advance involves a transition from complete ignorance, through a stage of partial knowledge based on evidence becoming gradually more conclusive, to the stage of practical certainty.”47
Working on his office floor ankle-deep with papers, Jeffreys composed The Earth: Its Origin, History, and Physical Constitution, the standard work on the planet’s structure until plate tectonics was discovered in the 1960s. (Sadly, while Jeffreys played the hero defending Bayes, he opposed the idea of continental drift as late as 1970, when he was 78, because he thought it meant the continents would have to push their way through viscous liquid.)
While analyzing earthquakes and tsunamis, Jeffreys worked out a new, objective form of Bayes for scientific applications and devised formal rules for selecting priors. As he put it, “Instead of trying to see whether there was any more satisfactory form of the prior probability, a succession of authors have said that the prior probability is nonsense and therefore that the principle of inverse probability, which cannot work without [priors], is nonsense too.”48
Jeffreys considered probability appropriate for all uncertainty, even something as apparently certain as a scientific law, whereas frequentists usually restricted probability to the uncertainties associated with theoretically repeatable data. As the statistician Dennis Lindley wrote, Jeffreys “would admit a probability for the existence of the greenhouse effect, whereas most [frequentist] statisticians would not and would confine their probabilities to the data on CO2, ozone, heights of the oceans, etc.”49
Jeffreys was particularly annoyed by Fisher’s measures of uncertainty, his “p-values” and significance levels. The p-value was a probability statement about data, given the hypothesis under consideration. Fisher had developed them for dealing with masses of agricultural data; he needed some way to determine which should be trashed, which filed away, and which followed up on immediately. Comparing two hypotheses, he could reject the chaff and save the wheat.
Technically, p-values let laboratory workers state that their experimental outcome offered statistically significant evidence against a hypothesis if the outcome (or a more extreme outcome) had only a small probability (under the
hypothesis) of having occurred by chance alone.
Jeffreys thought it very strange that a frequentist considered possible outcomes that had not occurred. He wanted to know the probability of his hypothesis about the epicenter of a particular earthquake, given his information about the arrival times of tsunamis caused by the earthquake. Why should possible outcomes that had not occurred make anyone reject a hypothesis? Few researchers repeated—or could repeat—an experiment at random many, many more times. “Imaginary repetitions,” a critic called them. Bayesians considered data as fixed evidence, not as something that can vary. Jeffreys certainly could not repeat a particular earthquake. Moreover, the p-value is a statement about data, whereas Jeffreys wanted to know about his hypothesis given his data. As a result, Jeffreys proposed using only observed data with Bayes’ rule to compute the probability that the hypothesis was true.
Newton, as Jeffreys pointed out, derived his law of gravity 100 years before Laplace proved it by discovering Jupiter’s and Saturn’s 877-year cycle: “There has not been a single date in the history of the law of gravitation when a modern significance test would not have rejected all laws [about gravitation] and left us with no law.”50
Bayes, on the other hand, “makes it possible to modify a law that has stood criticism for centuries without the need to suppose that its originator and his followers were useless blunderers.”51
Jeffreys concluded that p-values fundamentally distorted science. Frequentists, he complained, “appear to regard observations as a basis for possibly rejecting hypotheses, but in no case for supporting them.”52 But odds are that at least some of the hypotheses Fisher rejected were worth investigating or were actually true.