But a Jeopardy project, he realized, could provide a starring role for UIMA. Blue J would be more than a single machine. His team would pull together an entire conglomeration of Q-A approaches. The machine would house dozens, even hundreds of algorithms, each with its own specialty, all of them chasing down answers at the same time. A couple of the jury-rigged algorithms that James Fan had ginned up could do their thing. They would compete with others. Those that delivered good answers for different types of questions would rise in the results—a bit like the best singers in the Handel sing-along. As each one amassed its record, it would gain stature in its specialty and be deemed clueless in others. Loser algorithms—those that failed to produce good results in even a single niche—would be ignored and eventually removed. (Each one would have to prove its worth in at least one area to justify its inclusion.) As the system learned which algorithms to pay attention to, it would grow smarter. Blue J would evolve into an ecosystem in which the key to survival, for each of the algorithms, would be to contribute to correct responses to Jeopardy clues.
While part of his team grappled with Blue J’s architecture, Ferrucci had several researchers trolling the Internet for Jeopardy data. If this system was going to compete with humans in the game, it would require two types of information. First, it needed Jeopardy clues, thousands of them. This would be the machine’s study guide—what those in the field of machine learning called a training set. A human player might watch a few Jeopardy shows to get a feel for the types of clues and then take some time to study country capitals or brush up on Shakespeare. The computer would do the same work statistically. Each Jeopardy clue, of course, was unique and would never be repeated, so it wasn’t a question of learning the answers. But a training set would orient the researchers. Given thousands of clues, IBM programmers could see what percentage of them dealt with geography, U.S. presidents, words in a foreign language, soap operas, and hundreds of other categories—and how much detail the computer would need for each. The clue asking which presidential candidate carried New York State in 1948, for example (“Who is Thomas Dewey?”), indicated that the computer would have to keep track of White House losers as well as winners. What were the odds of a presidential loser popping up in a clue?
Digging through the training set, researchers could also rank various categories of puzzles and word games. They could calculate the odds that a Jeopardy match would include a puzzling Before & After, asking, for example, about the “Kill Bill star who played 11 seasons behind the plate for the New York Yankees” (“Who is Uma Thurman Munson?”). A rich training set would give them a chance to scrutinize the language in Jeopardy clues, including abbreviations, slang, and foreign words. If the machine didn’t recognize AKA as “also known as” or “oops!” as a misunderstanding, if it didn’t recognize “sayonara,” “au revoir,” “auf Wiedersehen,” and hundreds of other expressions, it could kiss entire Jeopardy categories goodbye. Without a good training set, researchers might be filling the brain of their bionic student with the wrong information.
Second, and nearly as important, they needed data on the performance of past Jeopardy champs. How often did they get the questions right? How long did they take to buzz in? What were their betting strategies in Double Jeopardy and Final Jeopardy? These humans were the competition, and their performance became the benchmark for Blue J.
In the end, it didn’t take a team of sleuths to track down much of this data. With a simple Internet search, they found a Web site called J! Archive, a trove of historical Jeopardy data. A labor of love by Jeopardy fans, the site detailed every game in the show’s history, with the clues, the contestants, their answers—and even the comments by Alex Trebek. Here were more than 180,000 clues, hundreds of categories, and the performance of thousands of players, from first-time losers to champions like Brad Rutter and Ken Jennings.
In these early days, the researchers focused only on Jennings. He was the gold standard. And with records of his seventy-four games—more than four times as many as any other champion—they could study his patterns, his strengths and vulnerabilities. They designed a chart, the Jennings Arc, to map his performance: the percentage of questions on which he won the buzz and his precision on those questions. Each of his games was represented by a dot, and the best ones, with high buzz and high accuracy, floated high on the chart to the extreme right. His precision averaged 92 percent and occasionally reached 100 percent. He routinely dominated the buzz, on one game answering an astounding 75 percent of the clues. For each of these games, the IBM team calculated how well a competitor would have to perform to beat him. The numbers varied, but it was clear that their machine would need to win the buzz at least half the time, get about nine of ten right—and also win its share of Daily Doubles.
In the early summer of 2007, after the bake-off, the Jeopardy team marked the performance of the Piquant system on the Jennings Arc. (Basement Baseline, which lacked a confidence gauge, did not produce enough data to be charted there.) Piquant’s performance was so far down and to the left of Ken Jennings’s dots, it appeared to be … well, exactly what it was: an alien species—and not destined for Jeopardy greatness.
When word of this performance spread around the Yorktown labs, it only fueled the concerns that Ferrucci’s team was heading for an embarrassing fall—if it ever got that far. Mark Wegman, then the head of computer science at IBM Research, described himself as someone who’s “usually wildly optimistic about technology.” But when he saw the initial numbers, he said, “I thought there was a 10 percent chance that in five years we could pull it off.”
For Ferrucci, Piquant’s failure was anything but discouraging. It gave him the impetus to march ahead on a different path, toward Blue J. “This was a chance to do something really, really big,” he said. However, he wasn’t sure his team would see it this way. So he gathered the group of twelve in a small meeting room at the Hawthorne labs. He started by describing the challenges ahead. It would be a three-to five-year project, similar in length to a military deployment. It would be intense, and it could be disastrous. But at the same time they had an opportunity to do something memorable. “We could sit here writing papers for the next five years,” he said, “or we build an entirely new type of computer.” He introduced, briefly, a nugget of realpolitik. There would be no other opportunities for them in Q-A technologies within IBM. He had effectively engineered a land grab, putting every related resource into his Jeopardy ecosystem. If they wanted to do this kind of science, he said, “this was the only place to be.”
Then he went around the room with a simple question: “Are you in or are you out?”
One by one, the researchers said yes. But their response was not encouraging. The consensus was that they could build a machine that could compete—but probably not beat—a human champion. “We thought it could earn positive money before getting to Final Jeopardy,” said Chu-Carroll, one of the only holdovers on Ferrucci’s team from the old TRec unit. “At least we wouldn’t be kicked off the stage.”
With this less than ringing endorsement, Ferrucci sent word to Paul Horn that the Jeopardy challenge was on. He promised to have a machine, within twenty-four months, that could compete against average human players. Within thirty-six to forty-eight months, his machine, he said, would beat champions one-quarter of the time. And within five to seven years, the Jeopardy machine would be “virtually unbeatable.” He added that this final goal might not be worth pursuing. “It is more useful,” he said, “to create a system that is less than perfect but easily adapted to new areas.” A week later, Ferrucci and a small team from IBM Research flew to Culver City, to the Robert Young Building on the Sony lot. There they’d see whether Harry Friedman would agree to let the yet-to-be-built Blue J play Jeopardy on national television.
4. Educating Blue J
JENNIFER CHU-CARROLL, sitting amid a clutter of hardware and piles of paper in her first-floor office in the Hawthorne labs, wondered what in the world to teach Blue J. How much of the Bible
would it have to know? The Holy Book popped up in hundreds of Jeopardy clues. But did that mean the computer needed to know every psalm, the laws of Deuteronomy, Jonah’s thoughts and prayers while inside the whale? Would a dose of Dostoevsky help? She could feed it The Idiot, Crime and Punishment, or any of the other classics that might pop up in a Jeopardy clue. When it came to traditional book knowledge, feeding Blue J’s brain was nearly as easy as Web surfing.
This was in July 2007. Chu-Carroll’s boss, David Ferrucci, and the small IBM contingent had just flown back from Culver City, where they had been given a provisional thumbs-up from Harry Friedman. A man-machine match would take place, perhaps in late 2010 or early 2011. IBM needed the deadline to mobilize the effort within the company and to establish it as a commitment, not just a vague possibility. Jeopardy, for its part, would bend the format a bit for the machine. The games would not include audio or visual clues, where contestants have to watch a snippet of video or recognize a bar of music. And they might let the machine buzz electronically instead of hitting a physical button. The onus, according to the preliminary agreement, was on IBM to come up with a viable player in time for the match.
It was up to Chu-Carroll and a few of her colleagues to map out the machine’s reading curriculum. Chu-Carroll had black bangs down to her eyes and often wore sweatshirts and jeans. Like practically everyone else on the team, she had a doctorate in computer science, hers from the University of Delaware. She had worked for five years at Lucent Technology’s Bell Labs, in New Jersey. There she taught machines how to participate in a dialogue and how to modulate their voices to communicate different signals. (Lucent was developing automated call centers.) When Chu-Carroll came to IBM in 2001, joining her husband, Mark, she plunged into building Q-A technologies. (Mark later left for Google.)
In mid-2007, the nascent Jeopardy system wasn’t really a machine at all. Fragments of a Jeopardy player existed as a collection of software programs, some of them hand-me-downs from the recent bake-off, all of them easy to load onto a laptop. As engineers pieced together an architecture for the new system, Chu-Carroll pondered a fundamental question: How knowledgeable did this computer really need to be? One of its forebears, Basement Baseline, had hunted down its answers on the Web. Blue J wouldn’t have that luxury. So as Chu-Carroll sat down for Blue J’s first day of school, her pupil was a tabula rasa.
She quickly turned to a promising resource. James Fan had already demonstrated the value of Wikipedia for answering a small subsection of Jeopardy clues. “It related to popular culture and what people care about,” Chu-Carroll said. So she set to work extracting much of the vast corpus of Wikipedia articles from the online site and putting them into a format that Blue J could read.
But how about books? Gutenberg.org offered hundreds of classics for free, along with a ranking of the most popular downloads. Chu-Carroll could feed any or all of them to Blue J. After all, words didn’t take up much space. Moby Dick, for example, was only 1.5 megabytes. Photographs on new camera phones packed more bits than that. So one day she downloaded the Gutenberg library and gave Blue J a crash course on the Great Books.
“It wasn’t a smart move,” she later admitted. “One of the most popular books on Gutenberg was a manual for surgeons from a hundred years ago.” This meant that when faced with a clue about modern medicine, Blue J could be consulting a source unschooled in antibiotics, CAT scans, and HIV, one fixated instead on scurvy, rickets, and infections (not to mention amputations) associated with trench warfare. “I’m not sure why that book is so popular,” said Chu-Carroll. “Are people doing at-home surgery?”
Whatever their motives, most human readers knew exactly what they were getting when downloading the medical relic. If not, they quickly found out. In addition to surgical descriptions, the book contained extraordinary pictures of exotic and horrifying conditions, such as elephantiasis of the penis. Aside from these images, the book’s interest was largely historical. Humans had little trouble placing it in this context. Chu-Carroll’s pupil, by contrast, had a maddening habit endemic among its ilk: It tended to take every source at its word.
Blue J’s literal-mindedness posed the greatest challenge at every step of its education. Finding suitable data for this gullible machine was only the first task. Once Blue J had its source material, from James Joyce to archives of the Boing-Boing blog, the IBM team would have to teach it to make sense of all those words: to place names and facts into context and to come to grips with how they were related to one another. Hamlet, to pick one example, was related not only to his mother, Gertrude, but also to Shakespeare, Denmark, Elizabethan literature, a famous soliloquy, and themes ranging from mortality to self-doubt. Preparing Blue J to navigate all of these connections for virtually every entity on earth, factual or fictional, would be the machine’s true education. The process would involve creating, testing, and fine-tuning thousands of algorithms. The final challenge would be to prepare it to play the game itself. Blue J would have to come up with answers it could bet on within three to five seconds. That job was still a year or two down the road.
For now, Chu-Carroll found herself contemplating academic heresy. Like college students paging through Cliff’s Notes or surfing Wikipedia, she began to wonder whether Blue J should bother with books at all. Each one contained so many passages that could be misconstrued. In the lingo of her colleagues, books had a sky-high noise-to-signal ratio. The signals, the potential answers, swam in oceans of words, so-called noise.
Imagine Blue J reading Mark Twain’s Huckleberry Finn. In one section, Huck and the escaped slave, Jim, are contemplating the night sky:
We had the sky up there, all speckled with stars, and we used to lay on our backs and look up at them, and discuss about whether they was made or only just happened. Jim he allowed they was made, but I allowed they happened; I judged it would have took too long to MAKE so many. Jim said the moon could a LAID them; well, that looked kind of reasonable, so I didn’t say nothing against it, because I’ve seen a frog lay most as many, so of course it could be done.
Assuming that Blue J could slog through the idiomatic language—no easy job for a computer—it could “learn” something about the cosmos. Both characters, it appeared, agreed that the moon, like a frog, could have laid the stars. It seemed “reasonable” to them, a conclusion Blue J would be likely to respect. A human would put that passage into context, learn something about Jim and Huck, and perhaps laugh. Blue J, it was safe to say, would never laugh. It would likely take note of an utterly fallacious parent-offspring relationship between the moon and the stars and record it. No doubt its mad hunt through hundreds of sources to answer a single Jeopardy clue would bring in much more astronomical data and statistically overwhelm this passage. In time, maybe the machine would develop trusted sources for such astronomical questions and wouldn’t be so foolish as to consult Huck Finn and Jim about the cosmos. But still, most books had too many words—too much noise—for the job ahead.
This led to an early conclusion about a Jeopardy machine. It didn’t need to know books, plays, symphonies, or TV sitcoms in great depth. It only needed to know about them. Unlike literature students, the machine would not be pressed to compare and contrast the themes of family or fate in Hamlet with those in Oedipus Rex. It just had to know they were there. When it came to art, it wouldn’t be evaluating the brushwork of Velázquez and Manet. It only needed to know some basic biographical facts about them, along with a handful of their most famous paintings. Ken Jennings, Ferrucci’s team learned, didn’t prepare for Jeopardy by plowing through big books. In Brainiac, he described endless practice with flash cards. The conclusion was clear: The IBM team didn’t need a genius. They had to build the world’s most impressive dilettante.
From their statistical analysis of twenty thousand Jeopardy clues drawn randomly from the past twenty years, Chu-Carroll and her colleagues knew how often each category, from U.S. presidents to geography, was likely to pop up. Cities and countries each accounted for a bit
more than 2 percent of the clues; Shakespeare and Abraham Lincoln were regulars on the big board. The team proceeded to load Blue J with the data most likely to contain the answers. It was a diet full of lists, encyclopedia entries, dictionaries, thesauruses, newswire articles, and downloaded Web pages. Then they tried out batches of Jeopardy clues to see how it fared.
Blue J was painfully slow. Laboring on a single computer, it created a logjam of data, sending information through the equivalent of a skinny straw when it needed a fire hose. It didn’t have enough bandwidth (the rate of data transfer). And it lacked computing muscle. This led to delays, or latency. It often took an hour or two to puzzle out the meaning of the clue, dig through its data to come up with a long list of possible answers, or “candidates,” evaluate them, choose one, and decide whether it was confident enough to bet. The best way to run the tests, Chu-Carroll and her colleagues eventually realized, was to ask Blue J questions before lunch. The machine would cogitate; they’d eat. It was an efficient division of labor.
An hour or so after they returned to the office, Blue J’s list of candidate answers would arrive. Many of them were on target, but some were ridiculous. One clue, for example, read: “A 2000 ad showing this pop sensation at ages 3 & 18 was the 100th “got milk?” ad.” Blue J, after exhaustive work, missed the answer (“Who is Britney Spears?”) by a mile, suggesting “What is Holy Crap?” as a possibility. The machine also volunteered that the diet of grasshoppers was “kosher” and that the Russian word for goodbye (“What is do svidaniya?”) was “cholesterol.”
Final Jeopardy Page 8