The Theory That Would Not Die

Page 29

by Sharon Bertsch McGrayne

Ignoring the defensive posture of many statistics departments, Smith launched an offensive in a radically new direction. Smith’s friends think of him as a lively, practical man with street smarts, people skills, and a can-do personality, the kind of person more comfortable in running shorts than academic garb. He was certainly willing to do the dirty work needed to make Bayes practical. He learned Italian to help translate de Finetti’s two-volume Theory of Probability and then got it published. For the first time de Finetti’s subjectivist approach was widely available to Anglo-American statisticians. Smith also developed filters, practical computer devices that would ease Bayesian computations enormously.

Next, Smith and three others—Lindley, José M. Bernardo, and Morris DeGroot—organized an international conference series for Bayesians in Valencia, Spain. It has been held regularly since 1979. Smith expected “the usual criticism from nonBayesians in reaction to whatever I say.” Sure enough, frequentists accused Bayesians of sectarian habits, meetings in remote locations, and mock cabarets featuring skits and songs with Bayesian themes. Other disciplines have done the same, of course. The conferences played a vital role in helping to build camaraderie in a small field under attack.

In 1984 Smith issued a manifesto—and italicized it for emphasis: “Efficient numerical integration procedures are the key to more widespread use of Bayesian methods.”10 With computerized data collection and storage, hand analyses were becoming impossible. When microcomputers appeared, attached to fast networks with graphics and vast storage capabilities, data analysts could finally hope to improvise as easily as they had with pencil and paper. With characteristic practicality, Smith set his University of Nottingham students to work developing efficient, user-friendly software for Bayesian problems in spatial statistics and epidemiology.

Intrigued with Smith’s projects, Alan Gelfand at the University of Connecticut asked Smith if he could spend his sabbatical year at Nottingham. When Gelfand arrived, Smith suggested they start something new. Gelfand recalled, “He gave me this Tanner and Wong paper, saying, ‘This is kind of an interesting paper. There must be something more to it.’” Wing Hung Wong and Martin A. Tanner, at the Universities of Chicago and Wisconsin, respectively, were interested in spatial image analysis for identifying genetic linkages and for scanning the brain using positron emission tomography (PET). Wong had been adapting for Bayesians the EM algorithm, an iterative system secretly developed by the National Security Agency during the Second World War or the early Cold War. Arthur Dempster and his student Nan Laird at Harvard discovered EM independently a generation later and published it for civilian use in 1977. Like the Gibbs sampler, the EM algorithm worked iteratively to turn a small data sample into estimates likely to be true for an entire population.

Gelfand was studying the Wong paper when David Clayton from the University of Leicester dropped by and said, “Oh, the paper by Geman and Geman has something to do with it, I think.” Clayton had written a technical report which, although never published, concerned Gibbs sampling. The minute Gelfand saw the Gemans’ paper, the pieces came together: Bayes, Gibbs sampling, Markov chains, and iterations. A Markov chain can be composed of scores of links, and for each one a range of possible variables must be sampled and calculated one after another. Anyone studying a rare and subtle effect must compute each chain over and over again to get a big enough number to reveal the rarity. The numbers involved become enormous, and the length and tediousness of the calculations turned off most researchers.

But Gelfand and Smith saw that replacing difficult integration with sampling would be a wonderful computational tool for Bayesians. “It went back to the most basic things you learn in an introductory statistics course,” Gelfand explained. “If you want to learn about a distribution or a population, you take samples from it. But don’t sample directly.” Imaging and spatial statisticians had been looking at local models as a whole, but Gelfand and Smith realized they should build long chains, series of observations generated one or two at a time, one after another. As Gelfand explained, “The trick was to look at simple distributions one at a time but never look at the whole. The value of each one depended only on the preceding value. Break the problem into tiny pieces that are easy to solve and then do millions of iterations. So you replace one high-dimensional draw with lots of low-dimensional draws that are easy. The technology was already in place. That’s how you break the curse of high-dimensionality.”11

Smith and Gelfand wrote their article as fast as they could. The elements of their system were already known, but their grand synthesis was new. Once others thought about it, they’d realize the importance of the method too.

When Smith spoke at a workshop in Quebec in June 1989, he showed that Markov chain Monte Carlo could be applied to almost any statistical problem. It was a revelation. Bayesians went into “shock induced by the sheer breadth of the method.”12 By replacing integration with Markov chains, they could finally, after 250 years, calculate realistic priors and likelihood functions and do the difficult calculations needed to get posterior probabilities.

To outsiders, one of the amazing aspects of Bayes’ history is that physicists and statisticians had known about Markov chains for decades. To illustrate this puzzling lapse, some flashbacks are required. Monte Carlo began in 1906, when Andrei Andreyevich Markov, a Russian mathematician, invented Markov chains of variables. The calculations took so long, though, that Markov himself applied his chains only to the vowels and consonants in a Pushkin poem.

Thirty years later, with the beginning of nuclear physics in the 1930s, the Italian physicist Enrico Fermi was studying neutrons in collision reactions. Fermi fought insomnia by computing Markov chains in his head to describe the paths of neutrons in collision reactions. To the surprise of his colleagues the next morning, Fermi could predict their experimental results. With a small mechanical adding machine, Fermi built Markov chains to solve other problems too. Physicists called the chains statistical sampling.

Fermi did not publish his methods, and, according to Jack Good, government censorship kept Markov chains under tight wraps during the Second World War. After the war Fermi helped John von Neumann and Stanislaw Ulam develop the technique for hydrogen-bomb developers using the new ENIAC computer at the University of Pennsylvania. To calculate the critical mass of neutrons needed to make a thermonuclear explosion, Maria Goeppert Mayer, a future Nobel Prize–winning physicist, simulated the process with Markov chains, following one neutron at a time and making decisions at various places as to whether the neutron was most likely to get absorbed, escape, die, or fission. The calculation was too complicated for existing IBM equipment, and many thought the job was also beyond computers. But Mayer reported that it did “not strain the capacity of the Eniac.”13 In 1949 she spoke at a symposium organized by the National Bureau of Standards, Oak Ridge National Laboratory, and RAND for physicists to brief mathematicians and statisticians on hitherto classified applications.

That same year Nicholas Metropolis, who had named the algorithm Monte Carlo for Ulam’s gambling uncle, described the method in general terms for statisticians in the prestigious Journal of the American Statistical Association. But he did not detail the algorithm’s modern form until 1953, when his article appeared in the Journal of Chemical Physics, which is generally found only in physics and chemistry libraries. Moreover, he and his coauthors—two husband-and-wife teams, Arianna and Marshall Rosenbluth and Augusta and Edward Teller—were concerned strictly with particles moving around a square. They did not generalize the method for other applications. Thus it was physicists and chemists who pioneered Monte Carlo methods. Working on early computers with between 400 and 80,000 bytes of memory, they dealt with memory losses, unreadable tapes, failed vacuum tubes, and programming in assembly language. Once it took literally months to track down a small programming error. In the 1950s RAND developed a lecture series on Monte Carlo techniques and used them in a specially built simulation laboratory to test case after case of problem
s too complex for mathematical formulas.

During this difficult period, statisticians were advised several more times to use Monte Carlo either with or without computers. In 1954 two statisticians associated with the British Atomic Energy Research Establishment recommended that readers of the Journal of the Royal Statistical Society consult “Poor Man’s Monte Carlo” for pen-and-paper calculations; John M. Hammersley and Keith W. Morton said Monte Carlo was as easy as “simple knitting.” Lindley described Markov chains in 1965 in a text for college students.

In a particularly poignant case, W. Keith Hastings, a mathematician at the University of Toronto, was approached by chemists who were studying 100 particles interacting with each other while subject to outside forces. Because of the 600 variables in the case, Hastings said he immediately realized the importance of Markov chains for mainstream statistics and devoted all his time to them: “I was excited because the underlying idea goes way back to Metropolis. As soon as I realized it, it was off to the races. I just had to work on it. I had no choice.” In 1970 he published a paper generalizing Metropolis’s algorithm in the statistical journal Biometrika. Again, Bayesians ignored the paper. Today, computers routinely use the Hastings–Metropolis algorithm to work on problems involving more than 500,000 hypotheses and thousands of parallel inference problems.

Hastings was 20 years ahead of his time. Had he published his paper when powerful computers were widely available, his career would have been very different. As he recalled, “A lot of statisticians were not oriented toward computing. They take these theoretical courses, crank out theoretical papers, and some of them want an exact answer.”14 The Hastings–Metropolis algorithm provides estimates, not precise numbers. Hastings dropped out of research and settled at the University of Victoria in British Columbia in 1971. He learned about the importance of his work after his retirement in 1992. Why did it take statisticians so long to understand the implications of an old method? And why were Gelfand and Smith first? “The best thing I can say for us is that we just sort of stumbled on it. We got lucky,” Gelfand says today. “It was just sort of sitting there, waiting for people to put the pieces together.”

Timing was important too. Gelfand and Smith published their synthesis just as cheap, high-speed desktop computers finally became powerful enough to house large software packages that could explore relationships between different variables. Bayes was beginning to look like a theory in want of a computer. The computations that had irritated Laplace in the 1780s and that frequentists avoided with their variable-scarce data sets seemed to be the problem—not the theory itself.

Yet Smith and Gelfand still thought of Monte Carlo as a last resort to be used in desperation for complicated cases. They wrote diffidently, careful to use the B-word only five times in 12 pages. “There was always some concern about using the B-word, a natural defensiveness on the part of Bayesians in terms of rocking the boat,” Gelfand said. “We were always an oppressed minority, trying to get some recognition. And even if we thought we were doing things the right way, we were only a small component of the statistical community and we didn’t have much outreach into the scientific community.”15

The Gelfand–Smith paper was an “epiphany in the world of statistics,” as Bayesians Christian P. Robert and George Casella reported. And just in case anyone missed their point, they added: “Definition: epiphany n. A spiritual event . . . a sudden flash of recognition.” Years later, they still described its impact in terms of “sparks,” “flash,” “shock,” “impact,” and “explosion.”16

Shedding their diffidence, Gelfand and Smith wrote a second paper six months later with Susan E. Hills and Amy Racine-Poon. This time they punctuated their mathematics exuberantly with words like “surprising,” “universality,” “versatility,” and “trivially implemented.” They concluded grandly, “The potential of the methodology is enormous, rendering straightforward the analysis of a number of problems hitherto regarded as intractable from a Bayesian perspective.”17 Luke Tierney at Carnegie Mellon tied the technique to Metropolis’s method, and the entire process—which physicists had called Monte Carlo—was baptized anew as Markov chain Monte Carlo, or MCMC for short. The combination of Bayes and MCMC has been called “arguably the most powerful mechanism ever created for processing data and knowledge.”18

When Gelfand and Smith gave an MCMC workshop at Ohio State University early in 1991, they were astonished when almost 80 scientists appeared. They weren’t statisticians, but they had been using Monte Carlo in archaeology, genetics, economics, and other subjects for years.

The next five years raced by in a frenzy of excitement. Problems that had been nightmares cracked open as easily as eggs for an omelet. A dozen years earlier the conference title “Practical Bayesian Statistics” had been a joke. But after 1990 Bayesian statisticians could study data sets in genomics or climatology and make models far bigger than physicists could ever have imagined when they first developed Monte Carlo methods. For the first time, Bayesians did not have to oversimplify “toy” assumptions.

Over the next decade, the most heavily cited paper in the mathematical sciences was a study of practical Bayesian applications in genetics, sports, ecology, sociology, and psychology. The number of publications using MCMC increased exponentially.

Almost instantaneously MCMC and Gibbs sampling changed statisticians’ entire method of attacking problems. In the words of Thomas Kuhn, it was a paradigm shift.19 MCMC solved real problems, used computer algorithms instead of theorems, and led statisticians and scientists into a world where “exact” meant “simulated” and repetitive computer operations replaced mathematical equations. It was a quantum leap in statistics.

When Smith and Gelfand published their paper, frequentists could do a vast amount more than Bayesians. But within years Bayesians could do more than frequentists. In the excitement that followed, Stanford statistician Jun S. Liu, working with biologist Charles E. Lawrence, showed genome analysts that Bayes and MCMC could reveal motifs in protein and DNA. The international project to sequence the human genome, launched in 1990, was generating enormous amounts of data. Liu showed how a few seconds on a workstation programmed for Bayes and iterative MCMC sampling could detect subtle, closely related patterns in protein and nucleic acid sequences. Then he and Lawrence could infer critical missing data pointing to common ancestors, structures, and functions. Soon genomics and computational biology were so overrun with researchers that Gelfand decided to look elsewhere for another research project.

Between 1995 and 2000 Bayesians developed particle filters like the Kalman filter and real-time applications in finance, image analysis, signal processing, and artificial intelligence. The number of attendees at Bayesian conferences in Valencia quadrupled in 20 years. In 1993, more than two centuries after his death, Thomas Bayes finally joined his clerical relatives in the Dictionary of National Biography.

Amid the Bayesian community’s frenzy over MCMC and Gibbs sampling, a generic software program moved Bayesian ideas out into the scientific and computer world.

In an example of serendipity, two groups 80 miles apart worked independently during the late 1980s on different aspects of the same problem. While Smith and Gelfand were developing the theory for MCMC in Nottingham, Smith’s student, David Spiegelhalter, was working in Cambridge at the Medical Research Council’s biostatistics unit. He had a rather different point of view about using Bayes for computer simulations. Statisticians had never considered producing software for others to be part of their jobs. But Spiegelhalter, influenced by computer science and artificial intelligence, decided it was part of his. In 1989 he started developing a generic software program for anyone who wanted to use graphical models for simulations. Once again, Clayton was an important influence. Spiegelhalter unveiled his free, off-the-shelf BUGS program (short for Bayesian Statistics Using Gibbs Sampling) in 1991.

BUGS caused the biggest single jump in Bayesian popularity. It is still the most popular software for Bayesian analyses
, and it spread Bayesian methods around the world.

“It wasn’t a very big project,” Spiegelhalter admits. “It was a staggeringly basic and powerful idea, relating Gibbs sampling to a graph to write generic programs.”20 Its simple code remains almost exactly the same today as it was in 1991.

Ecologists, sociologists, and geologists quickly adopted BUGS and its variants, WinBUGS for Microsoft users, LinBUGS for Linux, and OpenBUGS. But computer science, machine learning, and artificial intelligence also joyfully swallowed up BUGS. Since then it has been applied to disease mapping, pharmacometrics, ecology, health economics, genetics, archaeology, psychometrics, coastal engineering, educational performance, behavioral studies, econometrics, automated music transcription, sports modeling, fisheries stock assessment, and actuarial science. A Bayesian visiting a marine laboratory was surprised to discover all its scientists using BUGS with nary a statistician in sight.

Medical research and diagnostic testing were among the earliest beneficiaries of Bayes’ new popularity. Just as the MCMC frenzy appeared to be moderating, Peter Green of Bristol University showed Bayesians how to compare the elaborate hypotheses that scientists call models. Before 1996 anyone making a prediction about the risk of a stroke had to focus on one model at a time. Green showed how to jump between models without spending an infinite time on each. Previous studies had identified 10 possible factors involved in strokes. Green identified the top four: systolic blood pressure, exercise, diabetes, and daily aspirin.

Medical testing, in particular, benefited from Bayesian analysis. Many medical tests involve imaging, and Larry Bretthorst, a student of the Bayesian physicist Ed Jaynes, improved nuclear magnetic resonancing, or NMR, signal detection by several orders of magnitude in 1990. Bretthorst had studied imaging problems to improve the detection of radar signals for the Army Missile Command.

‹ Prev Next ›