The Theory That Would Not Die
Page 21
After the war, he decided that The Federalist might fill the bill for his classification project. By 1955 he was far enough along to rope in David L. Wallace, a young statistician at the University of Chicago. In his disarmingly casual way, Mosteller asked Wallace, “Why don’t you come up and spend some time in New England this summer, and work on this little project I’ve sort of started?”2 The two wound up spending more time studying The Federalist papers than Hamilton and Madison did writing them—“a horrible thought,” Wallace said later.
Wallace urged Mosteller to use Bayes’ rule for the project. Wallace had earned a Ph.D. in mathematics from Princeton in 1953 and would become a professor at the University of Chicago. But in 1955, his first year at Chicago, Savage was teaching from his recently published book on Bayes’ rule. Despite the hostility of most American statisticians, Wallace had been receptive to Bayesian ideas.
Wallace thought The Federalist might be an application where Bayes could be very helpful. “If you stick to relatively simple problems,” he explained, “like the ones taught in elementary statistics books, you can do it by Bayesian or nonBayesian methods, and the answers are not appreciably different.” Laplace had investigated such problems in the early 1800s and discovered the same thing. “I’m not really a Bayesian,” Wallace said. “I haven’t done much more than The Federalist, but . . . when you have large numbers of parameters, of unknowns to deal with, the difference between Bayes and non-Bayes grows immense.”
Mosteller was open to suggestion. Unlike Savage, Lindley, Raiffa, and Schlaifer, he was no fervent Bayesian. He was an eclectic problem solver who liked any technique that worked. He accepted the validity of both kinds of probability: probability as degrees of belief and probability as relative frequency. The problem, as he saw it, was that treating a unique event like “Hamilton wrote paper No. 52” was difficult with sampling theory. Bayes’ degrees of belief would be harder to specify but more widely applicable.
In addition, Mosteller liked grappling with critical social issues, not avoiding controversy by taking refuge in textbook examples. Reality added a certain frisson to a problem. As he put it, difficulties found “in the armchair” seldom resembled those in the field or scientific laboratory. In later years, when asked why he spent so much time on The Federalist papers, Mosteller would point to “that Bayesian hothouse” in the Harvard Business School and to the fact that Raiffa and Schlaifer did not deal with difficult problems or complex data. The discrepancy between too many Bayesian theories and too few practical applications disturbed Mosteller.
With Savage’s encouragement, Mosteller and Wallace started their quest to “apply a 200-year-old mathematical theorem to a 175-year-old historical problem.” In the process, Mosteller would organize the largest civilian application of Bayes’ rule since Laplace studied babies and Jeffreys analyzed earthquakes. Not coincidentally, Wallace and Mosteller would resort to so-called high-speed computers.
They had seemingly vast amounts of data: 94,000 words definitely written by Hamilton and 114,000 by Madison. Of these, they would ignore substantive words like “war,” “executive,” and “legislature” because their use varied by essay topic. They would keep “in,” “an,” “of,” “upon,” and other context-free articles, prepositions, and conjunctions. As work progressed, however, they became dissatisfied with their “rather catch-as-catch-can methods.” In a critical decision, they decided to turn their bagatelle of a historical problem into a serious empirical comparison between Bayesian and frequentist methods of data analysis. The Federalist papers would become a way to test Bayes’ rule as a discrimination method.
By 1960 Wallace was working full-time on developing a Bayesian analysis of The Federalist, working out the details for a mathematical model of their data. Given so many variables, Wallace and Mosteller would be mining the papers like low-grade ore, successively sifting, processing text in waves, and discarding words of no help. They would use numerical probabilities to express degrees of belief about propositions like “Hamilton wrote paper No. 52” and then use Bayes’ theorem to adjust these probabilities with more evidence.
Initially, they assumed 50–50 odds for the authorship of each paper. Then they used the frequencies of 30 words—one at a time—to improve their first probability estimate. In a two-stage analysis, they first looked at the 57 papers of known authorship and then used that information to analyze the 12 papers of unknown parentage. As their calculations became increasingly complicated, Wallace developed new algebraic methods for dealing with difficult integrals; his asymptotic approximations would form much of the project’s statistical meat.
Mosteller and Wallace adopted another important simplification. In-stead of using the mathematical vocabulary of probabilities, they adopted the everyday language of odds. They were expert mathematicians, but they found odds easier computationally and intuitively.
During the decade it took to analyze The Federalist, Mosteller kept busy on a number of fronts. His back-to-back investigations of the Truman election and the Kinsey report had turned him into the person to call when something went wrong. Over the years Harvard would ask Mosteller to chair four departments: social relations (as acting chair); statistics (as its founder); and, in the university’s School of Public Health, first biostatistics and then health policy and management.
He persuaded Harvard (“VERY SLOWLY,” he wrote a friend)3 to establish the statistics department. Joining the 1950s and 1960s push to apply mathematical models to social problems, he researched game theory, gambling, and learning, where Bayes’ theorem itself was not used but served as a metaphor for thinking and accommodating new ideas. Eventually, Mosteller’s interest in those fields waned, and he moved on to other pastures. Education remained a major interest. As part of the government’s post-Sputnik push to teach students at every level about probability, Mosteller wrote two textbooks about frequentism and Bayes’ rule for high school students. In 1961 he taught probability and statistics on NBC’s early-morning Continental Classroom series; his lectures were viewed by more than a million people and taken for credit by 75,000. In medical research Mosteller pioneered meta-analysis and strongly advocated randomized clinical trials, fair tests of medical treatments, and data-based medicine. He was one of the first to conduct large-scale studies of placebo effects, evaluations of many medical centers, collaborations between physicians and statisticians, and the use of large, mainframe computers.
How did Mosteller juggle a massive Bayesian analysis on top of his other work? He looked tubby and rumpled, but he was a superb organizer and utterly unfazed by controversy. He was genial; he engaged critics with a touch of humor, and he seemed to believe they were entitled to opinions he disagreed with. He was also patience itself and, with a “Gee, golly, shucks smile,” explained things over and over again.4 Doctrinaire only about grammar and punctuation, he once wrote a student about his paper, “I am in a lonely hotel room, surrounded by whiches.”5
Mosteller was also very hard working. He once posted a sign in his office, “What have I done for statistics in the past hour?”6 For a short time he recorded what he did every 15 minutes in the day. He was also, as he himself pointed out, the beneficiary of a bygone era. His wife cared for everything in his life except his professional work, and several women at Harvard devoted their careers to his success, including his longtime secretary and Cleo Youtz, his statistical assistant for more than 50 years.
Mosteller was also said to involve in his research any student who stepped within 50 feet of his office. Persi Diaconis, a professional magician for several years after he ran away from home at the age of 14, met Mosteller on his first day as a graduate student. Mosteller interviewed him in his low-key, friendly way: “I see you’re interested in number theory. I’m interested in number theory. Could you help me do this problem?”7 They published the result together, and Diaconis went on to have a stellar career at Stanford University. It was said that Mosteller’s last collaborator on Earth was a mountaintop hermit and th
at Mosteller climbed the peak and persuaded him to cowrite a book. In truth, Mosteller collaborated only with people he considered worth his time, including Diaconis, John Tukey, the future U.S. senator Daniel Patrick Moynihan, economist Milton Friedman, and statisticians like Savage. Finally, Mosteller credited his success to the fact that “somewhere along the road I got this new way of doing scholarly work in little groups of people.”8 Colleagues and research assistants would divide up an interesting topic, meet every week or two, pass memos back and forth, and in four or five years publish a book. Working four or five such groups at a time, Mosteller wrote or cowrote 57 books, 36 reports, and more than 360 papers, including one with each of his children.
Four years into The Federalist project, he and Wallace had a breakthrough. A historian, Douglass Adair, tipped them off to a study from 1916 showing that Hamilton used “while” whereas Madison wrote “whilst.” Adair’s news told them that screening for words in the anonymous 12 papers might pan out.
The problem was that “while” and “whilst” were not used often enough to identify all 12 papers, and a printer’s typographical error or an editor’s revision could have tainted the evidence. Single words here and there would not suffice. Wallace and Mosteller would have to accumulate a large number of marker words like “while” and “whilst” and determine their frequency in each and every Federalist paper.
Mosteller started The Federalist project armed with a slide rule, a Royal manual typewriter, an electric 10-key adding machine, and a 10-bank electric Monroe calculator that multiplied and divided automatically. He greatly missed a device he had enjoyed using at MIT: an overhead projector. He and Wallace soon realized they would have to use computers. Harvard had no computer facilities of its own and relied on a cooperative arrangement with MIT. Mosteller and Wallace wound up using a substantial chunk of Harvard’s allocation. Today, a desktop computer would be faster. Also slowing them down was the fact that Fortran was only two years old, awkward, and hard to program for words. They were “straddling the introduction of the computer for linguistic counts and the old hand-counting methods, with the disadvantages of both.”
Substituting student brawn for computer power, Mosteller organized an army of 100 helpers—80 Harvard students plus 20 others. For several years, his soldiers punched cards for the supposedly high-speed computer.
Programming proceeded so slowly that Youtz speeded up the search for marker words by organizing a makeshift concordance by hand. Threading electric typewriters with rolls of adding machine tape, students transcribed one word to a line, cut the tape into strips with one word per strip, sorted them alphabetically, and counted the strips. Once someone, forever unnamed, exhaled deeply and blew a cloud of statistical confetti. Within a few days, however, Youtz’s typists discovered that Hamilton used “upon” twice per paper while Madison seldom used it at all.
Soon thereafter, students punching computer cards found a fourth marker word, “enough,” also used by Hamilton but never by Madison. By now, Mosteller had four words pointing to Madison as the author of the disputed papers. But here again, by editing one another’s papers, Madison and Hamilton could have melded their styles. As Mosteller and Wallace concluded, “We are not faced with a black-or-white situation, and we are not going to provide an absolutely conclusive settlement. . . . Strong confidence in conclusions is the most an investigation can be expected to offer.” They needed to extend their evidence and measure its strength.
Venturing beyond the simplest Bayesian applications, they found themselves knee-deep “in a welter of makeshifts and approximations.” Bayesian methods for data analysis were in their infancy, so they had to develop new theory, computer programs, and simple techniques, like those frequentists had developed before 1935. Discarding whatever did not work, they wound up tackling 25 difficult technical problems and publishing four meaty, parallel studies comparing Bayes, frequentism, and two simplified Bayesian approaches. The calculations became so complicated that they checked their work by hand, largely on slide rules. “Upon” was their single best discriminator by a factor of four. Other good makers were “whilst,” “there,” and “on.” “Even a motherly eye,” Mosteller wrote, could see disparities between “may” and “his.”
Their biggest surprise was that prior distributions—the bête noir of Bayes’ rule—were not of major importance. “This was awesome,” Mosteller said. “We went in thinking that everything depended on what sort of prior information you used. Our finding says that statisticians should remove some attention from the choice of prior information and pay more attention to their choice of models for their data.”9
In the end, they included prior odds only because Bayes’ theorem called for them, and they arbitrarily assigned equal odds to Madison and Hamilton. The prior odds turned out to be so unimportant that Mosteller and Wallace could have let their readers name them. In a timely analogy, Mosteller noted that a single measurement by an astronaut on the moon would be enough to overwhelm any expert’s prior opinion about the depth of the dust on the lunar surface. So why include the prior at all? Because with controversial or scanty data, enormous amounts of observational material might be needed to decide the issue.
When Mosteller and Wallace published their report in 1964 they announced happily that “we tracked the problems of Bayesian analysis to their lair and solved the problem of the disputed Federalist papers.” The odds were “satisfyingly fat” that Madison wrote all 12 of them. Their weakest case was No. 55, where the odds were 240 to 1 in Madison’s favor.
Their publication was the fourth about Bayes’ rule to appear in a three-year period. It followed works by Jeffreys, Savage, and Raiffa and Schlaifer. Of all these works, only Mosteller and Wallace’s had dared treat real issues with Bayesian statistics and modern computers. Mosteller had thought about The Federalist papers for 23 years and worked on them for ten. It would long remain the largest problem publicly attacked by Bayesian methods.
The work is still admired as a profound analysis of a difficult problem. Reviewers used words like “ideal,” “impressive,” “impeccable,” and “Herculean.” As late as 1990 it was considered the greatest single Bayesian case study.
Despite the razzle-dazzle, no one followed it up. No one—not even Mosteller and Wallace—tried to confirm its results by reanalyzing the material using nonBayesian methods. Who else could organize committees of collaborators and armies of students to empower a 1960s computer to solve large and complex problems?
As for Mosteller himself, what was his reaction? Satisfaction, of course. But also the feeling that, as a friend said, “Hey, here’s a good application, a new technique, let’s try it, and then find others too.” A number of his books described Bayesian techniques, and to this day Diaconis regards Mosteller as a committed Bayesian who tried hard to get social scientists to accept Bayesian methods. Yet Mosteller never again devoted an entire project to Bayes. While his famous study of poverty in collaboration with Senator Moynihan influenced public policy, it did not make major use of Bayes’ rule. When a student won a prize for a Bayesian Ph.D. thesis, Mosteller wrote a congratulatory note: “I think Bayesian methods are about to take off. But then I’ve been saying that for twenty-five years.”10
13.
the cold warrior
The Federalist project impressed the still small world of professional statisticians, but John Tukey, a star from the world of Cold War spying, would give Bayes’ rule the opportunity to demonstrate its prowess before 20 million American television viewers. But would the statistical community learn from Tukey’s example that Bayes had come of age? That was the question.
Bayes’ big chance at fame commenced in 1960 with the race between Senator Kennedy and Vice President Richard M. Nixon to succeed Eisenhower as president. The election was far too close to call, but the nation’s three major television networks competed fiercely to be the first to declare the victor. Winning the race would translate into prestige and advertising dollars. For the National Broadca
sting Corporation (NBC), there was a bonus: the opportunity to show off the latest computers made by its corporate owner, Radio Corporation of America (RCA).
NBC’s Huntley–Brinkley Report, the nation’s top-rated TV news program, reached 20 million viewers each weeknight. Co-anchors Chet Huntley, broadcasting from New York, and David Brinkley, from Washington, were celebrities; more people could recognize them than Cary Grant or James Stewart. NBC’s fast-paced format and informal nightly sign-off—“Good night, Chet,” “Good night, David”—transformed TV news.
Despite the program’s popularity, memories of the polling industry’s spectacularly poor performances in the 1936 and 1948 elections as well as the extraordinarily close Nixon–Kennedy race made network executives nervous. In preparation for Election Day, NBC went looking for someone to help it predict the winner. In the first of a series of surprises, the network approached a Princeton University professor, John W. Tukey.
Today Tukey is best known for the terms “bit” and “software,” and few outside statistics and engineering recognize his name. But he was a man of staggering accomplishments in the cloak-and-dagger world of military research, especially in code breaking and high-tech weaponry. He worked two jobs 30 miles apart: at Princeton University, where he was a professor of statistics, and at AT&T’s Bell Laboratories, then widely considered the finest industrial research laboratory in the world. From these vantage points, he advised five successive U.S. presidents, the National Security Agency, and the Central Intelligence Agency.