The Theory That Would Not Die

Home > Other > The Theory That Would Not Die > Page 32
The Theory That Would Not Die Page 32

by Sharon Bertsch McGrayne


  In 1992 Heckerman moved from Stanford to Microsoft, where he founded and manages the Machine Learning and Applied Statistics Group of Microsoft Research. The problems there are very different. Because Stanford had plenty of experts but little data, he says he built Bayesian nets with priors based on expert opinion: “But Microsoft had lots of data and only a few experts, so we got into combining expert knowledge with data.” One of Microsoft’s first Bayesian applications helped parents with sick children type in their children’s symptoms and learn the best course of action. In 1996 Bill Gates, cofounder of Microsoft, made Bayes headline news by announcing that Microsoft’s competitive advantage lay in its expertise in Bayesian networks.

  That same year Heckerman, Robert Rounthwaite, Joshua Goodman, Eric Horwitz, and others began investigating Bayesian antispam techniques. Remember vVi-@-gra, l0w mOrtg@ge rates, PARTNERSHIP INVESTMENT, and !!!! PharammcyByMAIL? Advertisements that are unwanted and often pornographic and fraudulent are sent to millions without their permission. Spam soon accounted for more than half of all mail on the Internet, and some e-mail users spent half an hour a day weeding it out.

  Bayesian methods attack spam by using words and phrases in the message to determine the probability that the message is unwanted. An e-mail’s spam score can soar near certainty, 0.9999, when it contains phrases like “our price” and “most trusted”; coded words like “genierc virgaa”; and uppercase letters and punctuation like !!! or . High-scoring messages are automatically banished to junk mail files. Users refine their own filters by reading low-scoring messages and either keeping them or sending them to trash and junk files. This use of Bayesian optimal classifiers is similar to the technique used by Frederick Mosteller and David Wallace to determine who wrote certain Federalist papers.

  Bayesian theory is firmly embedded in Microsoft’s Windows operating system. In addition, a variety of Bayesian techniques are involved in Microsoft’s handwriting recognition; recommender systems; the question-answering box in the upper right corner of a PC’s monitor screen; a datamining software package for tracking business sales; a program that infers the applications that users will want and preloads them before they are requested; and software to make traffic jam predictions for drivers to check before their commute.

  Bayes was blamed—unfairly, say Heckerman and Horwitz—for Microsoft’s memorably annoying paperclip, Clippy. The cartoon character was originally programmed using Bayesian belief networks to make inferences about what a user knew and did not know about letter writing. After the writer reached a certain threshold of ignorance and frustration, Clippy popped up cheerily with the grammatically improper observation, “It looks like you’re writing a letter. Would you like help?” Before Clippy was introduced to the world, however, nonBayesians had substituted a cruder algorithm that made Clippy pop up irritatingly often. The program was so unpopular it was retired.

  Bayes and Laplace would probably be appalled to learn that their work is heavily involved in selling products. Much online commerce relies on recommender filters, also called collaborative filters, built on the assumption that people who agreed about one product will probably agree on another. As the e-commerce refrain goes, “If you liked this book/song/movie, you’ll like that one too.” The updating used in machine learning does not necessarily follow Bayes’ theorem formally but “shares its perspective.” A 1-million contest sponsored by Netflix.com illustrates the prominent role of Bayesian concepts in modern e-commerce and learning theory. In 2006 the online film-rental company launched a search for the best recommender system to improve its own algorithm. More than 50,000 contestants from 186 countries vied over the four years of the competition. The AT&T Labs team organized around Yehuda Koren, Christopher T. Volinsky, and Robert M. Bell won the prize in September 2009.

  Interestingly, although no contestants questioned Bayes as a legitimate method, almost none wrote a formal Bayesian model. The winning group relied on empirical Bayes but estimated the initial priors according to their frequencies. The film-rental company’s data set was too big and too filled with unknowns for anyone to—almost instantaneously—create a model, assign priors, update posteriors repeatedly, and recommend films to clients. Instead, the winning algorithm had a Bayesian “perspective” and was laced with Bayesian “flavors.” However, by far the most important lesson learned from the Netflix competition originated as a Bayesian idea: sharing.

  Volinsky had used Bayesian model averaging for sharing and averaging complementary models while working in 1997 on his Ph.D. thesis about predicting the probability that a patient will have a stroke. But the Volinsky and Bell team did not employ the method directly for Netflix. Nevertheless, Volinsky emphasized how “due to my Bayesian Model Averaging training, it was quite intuitive for me that combining models was going to be the best way to improve predictive performance. Bayesian Model Averaging studies show that when two models that are not highly correlated are combined in a smart way, the combination often does better than either individual model.” The contest publicized Bayes’ reputation as a fertile approach to learning far beyond mere Bayesian technology.

  Web users employ several forms of Bayes to search through the billions of documents and locate what they want. Before that can happen, though, each document must be profiled or categorized, organized, and sorted, and its probable interconnectedness with other documents must be calculated. At that point, we can type into a search engine the unrelated keywords we want to appear in a document, for example, “parrots,” “madrigals,” and “Afghan language.” Bayes’ rule can winnow through billions of web pages and find two relevant ones in 0.31 seconds. “They’re inferential problems,” says Peter Hoff at the University of Washington. “Given that you find one document interesting, can you find other documents that will interest you too?”

  When Google starts projects involving large amounts of data, its giant search engines often try naïve Bayesian methods first. Naïve Bayes assumes simplistically that every variable is independent of the others; thus, a patient’s fever and elevated white blood cell counts are treated as if they had nothing to do with each other. According to Google’s research director, Peter Norvig, “There must have been dozens of times when a project started with naïve Bayes, just because it was easy to do and we expected to replace it with something more sophisticated later, but in the end the vast amount of data meant that a more complex technique was not needed.”

  Google also uses Bayesian techniques to classify spam and pornography and to find related words, phrases, and documents. A very large Bayesian network finds synonyms of words and phrases. Instead of downloading dictionaries for a spelling checker, Google conducted a full-text search of the entire Internet looking for all the different ways words can be spelled. The result was a flexible system that could recognize that “shaorn” should have been “Sharon” and correct the typo.

  While Bayes has helped revolutionize modern life on the web, it is also helping to finesse the Tower of Babel that has separated linguistic communities for millennia. During the Second World War, Warren Weaver of the Rockefeller Foundation was impressed with how “a multiplicity of languages impedes cultural interchange between the peoples of the earth and is a serious deterrent to international understanding.”6 Struck by the power of mechanized cryptography and by Claude Shannon’s new information theory, Weaver suggested that computerized statistical methods could treat translation as a cryptography problem. In the absence of computer power and a wealth of machine-readable text, Weaver’s idea lay fallow for decades.

  Ever since, the holy grail of translators has been a universal machine that can transform written and spoken words from one language into any other. As part of this endeavor, linguists like Noam Chomsky developed structural rules for English sentences, subjects, verbs, adjectives, and grammar but failed to produce an algorithm that could explain why one string of words makes an English sentence while another string does not.

  During the 1970s IBM had two competing teams workin
g on a similar problem, speech recognition. One group, filled with linguists, studied the rules of grammar. The other group, led by Mercer and Brown, who later went to RenTech, was filled with mathematically inclined communications specialists, computer scientists, and engineers. They took a different tack, replaced logical grammar with Bayes’ rule, and were ignored for a decade.

  Mercer’s ambition was to make computers do intelligent things, and voice recognition seemed to be the way to make this happen. For both Mercer and Brown speech recognition was a problem about taking a signal that had passed through a noisy channel like a telephone and then determining the most probable sentence that the speaker had in mind. Ignoring grammar rules, they decided to figure out the statistical probability that words and phrases in one language would wind up as particular words or phrases in another. They did not have to know any language at all. They were simply computing the probability of a single word given all the words that preceded it in a sentence. For example, by looking at pairs of English words they realized that the word after “the” was highly unlikely to be “the” or “a,” somewhat more likely to be “cantaloupe,” and still more likely to be “tree.”

  “It all hinged on Bayes’ theorem,” Mercer recalled. “We were given an acoustic output, and we’d like to find the most probable word sequence, given the acoustic sequence that we heard.” Their prior knowledge consisted of the most probable order of the words in an English sentence, which they could get by studying enormous amounts of English text.

  The big problem throughout the 1970s was finding enough data. They needed bodies of text focused on a fairly small topic, but nothing as adult as the New York Times. At first they worked their way through old, out-of-copyright children’s books; 1,000 words from a U.S. Patent Office experiment with laser technology; and 60 million words of Braille-readable text from the American Printing House for the Blind.

  At an international acoustic, speech, and signal meeting the group wore identical T-shirts printed with the words “Fundamental Equation of Speech Recognition” followed by Bayes’ theorem. They developed “a bit of swagger, I’m ashamed to say,” Mercer recalled. “We were an obnoxious bunch back in those days.”

  In a major breakthrough in the late 1980s they gained access to French and English translations of the Canadian parliament’s daily debates, about 100 million words in computer-readable form. From them, IBM extracted about three million pairs of sentences, almost 99% of which were actual translations of one another. It was a Rosetta Stone in English and French. “You had a day’s worth of English and the corresponding day’s worth of French, so things were lined up to that extent, but we didn’t know that this sentence or word went with this or that sentence or word. For example, while the English shouted, ‘Hear! Hear!,’ the French said, ‘Bravo!’ So we began working on getting a better alignment of the sentences. We were using the same methods as in speech recognition: Bayes’ theorem and hidden Markov models.” The latter are particularly useful for recognizing patterns that involve likely time sequences, for example, predicting one word in a sentence based on the previous word.

  In a landmark paper in 1990 the group applied Bayes’ theorem to full sentences. There was a small probability that the sentence, “President Lincoln was a good lawyer,” means “Le matin je me brosse les dents” but a relatively large probability that it means “Le president Lincoln était un bon avocat.” After that paper, several of the leading machine translation systems incorporated Bayes’ rule.

  In 1993, lured by lucre and the challenge, Mercer and Brown moved from IBM and machine translation to RenTech, where they became vice presidents and co–portfolio managers for technical trading. So many members of IBM’s speech recognition group joined them that critics complain they set back the field of machine translation five years.

  After the 9/11 disaster and the start of the war in Iraq, the military and the intelligence communities poured money into machine translation. DARPA, the U.S. Air Force, and the intelligence services want to ease the burden on human translators working with such little-studied languages as Uzbek, Pashto, Dari, and Nepali.

  Machine translation got still another boost when Google trawled the Internet for more Rosetta Stone texts: news stories and documents published in both English and another language. United Nations documents alone contributed 200 billion words. By this time the web was churning out enormous amounts of text, free for the asking. Combing English words on the web, Google counted all the times that, for example, a two-word sequence in English meant “of the.” To determine which words in the English sentence correspond to which words in the other language, Google uses Bayes to align the sentences in the most probable fit.

  The blue ribbons Google won in 2005 in a machine language contest sponsored by the National Institute of Standards and Technology showed that progress was coming, not from better algorithms, but from more training data. Computers don’t “understand” anything, but they do recognize patterns. By 2009 Google was providing online translations in dozens of languages, including English, Albanian, Arabic, Bulgarian, Catalan, Chinese, Croatian, Czech, Danish, Dutch, Estonian, Filipino, Finnish, and French.

  The Tower of Babel is crumbling.

  Even as Bayes’ rule was improving human communications, it was returning full circle to the fundamental question that had occupied Bayes, Price, and Laplace: How do we learn? Using Bayes’ rule, more than half a million students in the United States learn the answer to that question each year: we combine old knowledge with new. Approximately 2,600 secondary schools teach algebra and geometry with Bayesian computer programs developed at Carnegie Mellon University since the late 1980s. The software also teaches French, English as a second language, chemistry, physics, and statistics.

  The programs, called Cognitive Tutors, are based on John R. Anderson’s idea that Bayes is a surrogate for the way we naturally learn, that is to say, gradually. The ability to accumulate evidence is an optimal survival strategy, but our brains cannot assign a high priority to everything. Therefore, most students must see and work many times with a mathematical concept before they can retrieve and apply it at will. Our ability to do so depends on how frequently and recently we studied the concept.

  In addition to viewing Bayes as a continuous learning process, Cognitive Tutors depend on Bayes’ theorem for calculating each student’s “skillometer,” the probability that the individual has mastered a topic and is ready for a new challenge. Ten years after this double-edged Bayesian approach was launched, its students were learning as much or more than conventionally taught pupils—in a third the time.

  The flowering of Bayesian networks, neural nets, and artificial intelligence nets has helped neuroscientists study how the brain’s neurons process information that arrives directly and indirectly, a little at a time, in tiny, often contradictory packets. As a computational tool and learning theory, Bayes has been involved in mapping the brain and analyzing its circuitry as well as in decoding signals from neurons and harnessing them to make better prostheses and robots.

  Hundreds of megabits of sensory information bombard the waking brain every second. From that stream of data, 10 billion nerve cells extract information and correct prior understanding several times every 100 milliseconds. Discerning which sensory stimulus has caused which neuronal response is a difficult problem: the neurons fire unpredictably, scientists cannot monitor all of them at once, and the brain combines cues from multiple sources. The vision regions of our brains, for example, create three-dimensional objects and scenes. To do so, they rely on our prior knowledge about the regularities in our environment—for example, that light generally shines from above and that straight lines and 90-degree angles are apt to be man-made. But our brains refine that knowledge with new data pouring in about depth, contours, symmetry, lines of curvature, texture, shading, reflectance, foreshortening, and motion.

  In 1998 the neurostatistician Emery N. Brown of MIT and Massachusetts General Hospital realized that Bayesian methods might
deal with these uncertainties. Using Kalman filters, he and the MIT neuroscientist Matthew A. Wilson described a rat’s brain as it processed information about the animal’s location in its environment. Approximately 30 so-called place neurons in the hippocampus keep a rat informed about its location. As a laboratory rat foraged in a box scattered randomly with chocolate tidbits, electrodes implanted in its brain imaged some of the place neurons as they fired. A Bayesian filter sequentially updated the rat’s position in the box. The researchers could see neither the rat nor its box, but by watching the neurons fire they could track the rat’s movements. Thanks to Bayes, Brown could reconstruct the path of the chocolate-loving rat with only a fifth or a tenth of the neurons that previous methods had required.

  To explore the practicality of using the living brain to power prostheses and robots, Brown’s statistical method was replicated with a few dozen motor neurons, Bayesian algorithms, and Bayesian particle filters. The goal is to develop an artificial arm that can smoothly reach, rotate the hand, move fingers independently, and grasp and retrieve objects. Illustrating the possibilities of the approach, a rhesus monkey in Andrew B. Schwartz’s laboratory at the University of Pittsburgh gazed longingly at an enticing treat. With his arms immobilized in plastic tubes and his mouth salivating, the motor neurons in his brain fired repeatedly, activating a robotic arm. The monkey’s control was so precise it could reach out with the robotic arm, nab the treat, and fetch it back to eat. Frequentist methods can deal with simple backwards-and-forwards motion, but Bayesian neurostatisticians believe their algorithms will be powerful and flexible enough to control a robotic arm’s position, rotation, acceleration, velocity, momentum, and grasp.

  These attempts to capitalize on all the information available in neurons raise questions: What does the brain itself do? Does it maximize the information it gets from an uncertain world by performing Bayesian-like computations? In discussions of these questions, Bayes has become more than just an aid for data analysis and decision making. It has become a theoretical framework for explaining how the brain works. In fact, as such, the “Bayesian Brain” has become a metaphor for a human brain that mimics probability.

 

‹ Prev