Final Jeopardy

Page 13

by Stephen Baker

Crain had fallen into the Jeopardy gig months earlier through a chance encounter with David Shepler. Crain was working on a pilot documentary called EcoFreaks, telling the stories of people working at the fringes of the environmental movement. He said it involved spending one evening in New York with “freegans,” Dumpster-divers devoted to reusing trash. On the next assignment, Crain and the crew drove north to the college town of New Paltz, New York. There Shepler—with the attention to detail he later demonstrated managing the Jeopardy project—had built a three-story house that would generate as much energy as it consumed, a so-called zero net-energy structure. While showing Crain the triple-pane windows, geothermal exchange unit, and solar panel inverter, Shepler asked the young actor if he might be interested in hosting a series of Jeopardy shows. “I said ‘yes’ before he even had a chance to finish the sentence,” Crain said.

On occasion, Crain could irritate Ferrucci by making jokes at Watson’s expense. The computer often opened itself to such jibes by mauling pronunciation, especially of foreign words. And it had the unfortunate habit of spelling out punctuation it didn’t understand. One day, in the category Hair-y Situation, Watson said, “Let’s play Hair-dash-Y Situation for two hundred.” Crain imitated this bionic voice, getting a laugh from the small entourage of technicians and scientists. Ferrucci shook his head and muttered. Later, when Crain imitated a mangled name, Ferrucci channeled his irritation into feigned outrage: “He’s making fun of him! It’s like making fun of someone with a speech impediment!” (Once, Ferrucci said, he brought his wife and two daughters to a sparring session. When one of the girls heard Crain mimicking Watson, she said, “Daddy, why is that man being so mean to Watson?”)

As the day’s sparring session began, Crain gave the first two human contestants a chance to acquaint themselves with the buzzers. They tried several dozen old clues. Then he asked if they wanted Watson to join them. They nodded. “Okay, Burn, let him loose,” Crain said. Burn Lewis, the member of Ferrucci’s team who orchestrated the show from a tiny control room, pressed some buttons. The third competitor, an empty presence at the podium bearing the nameplate Watson, assumed its position. It might as well have been a ghost.

In the first game, it was clear the humans were dealing with a prodigious force that behaved differently from them. While humans almost always oriented themselves in a category by starting with the easier $200 clues, Watson began with the $1,000 clues at the bottom of the board and worked its way up. There was a logic to this. While humans heard all the answers, right and wrong, and learned from them, Watson was deaf to the proceedings. If it won the buzz, answered the clue, and got to pick another one, it could assume that it had been right. But that was its only feedback. Watson was senseless to all of the signals its human competitors displayed—the smiles, the gasps, the confident tapping of fingers, the halting speech and darting eyes spelling panic. More important, it lost out on game intelligence. If a human answered a clue incorrectly, Watson was liable to buzz on what was known as the rebound and deliver the very same incorrect answer. What was worse, Watson had a far harder time orienting itself to the categories. How would it understand the Hair-y Situation category without hearing the other contestants’ correct answers? During these weeks, Ferrucci’s team was talking with the Jeopardy executives about giving Watson an electronic feed with the text of the correct answer after each clue. But for the time being, the machine was learning nothing. So why not start each category with the priciest clues? The high bets might spook the humans. What’s more, IBM’s statistical analysis indicated that Watson was likely to land more Daily Doubles in those pricier realms.

Watson started off with Capital Cities, an apparently straightforward category that seemed to promise the machine’s favorite type of answer: factoids. It jumped straight to the $1,000 clue: “The Rideau Canal separates this North American capital into upper and lower regions.” Todd Crain read the clue, and Lewis, in the control room, hit the button to turn on the light around the clue, opening it up for buzzes. Within milliseconds Watson had the clue all to itself.

“Watson?” Crain said.

“What is Ottawa?” Watson answered. With that, it raced through the entire category, with each correct answer reinforcing its confidence that it would know the others. Crain read each clue, the humans squeezed the button, and Watson beat them to it. It had no trouble with the South American city founded in 1535 by Pizarro (“What is Lima?”) or the capital that fell to British troops in 1917 and to U.S. troops on April 9, 2003 (“What is Baghdad?”). These were factoids, each one wrapped in the most helpful data for Watson: hard facts unencumbered by humor, slang, or the cultural references that could tie a cognitive engine into knots. No, this category delivered a steady stream of dates, distances, specific names and numbers. For a Jeopardy computer, it was comfort food.

“Very good, Watson!” Crain said.

But after that promising start, Watson started to falter. Certain categories were simply hard for it to figure out. One was called I’ll Take the Country from Here, Thanks. When Watson processed the $400 clue, “Nicolas Sarkozy from Jacques Chirac,” it didn’t know how to answer. In a few milliseconds it could establish that both names corresponded to presidents of France. But it did not understand the category well enough to build up confidence in an answer (“What is France?”). And it was not getting any orientation from the action of the game. Humans owned that category. Watson sat it out.

Then, in the category Collegiate Rhyme Time, Watson showed its stuff, but not enough to win. One asked for the smell of the document you receive upon graduating. Watson understood the clue perfectly and searched for synonyms for “document,” then attempted to match them with words related to “smell.” The best it could come up with was “What is bill feel?” (“What is diploma aroma?”).

The real problems started when Watson found itself facing Greg Lindsay, a journalist and a two-time Jeopardy champion. Lindsay, thirty-two, had spent much of his time at the University of Illinois on the Quiz Bowl circuit, where he occasionally ran into Ken Jennings. In order to spar with Watson, Lindsay had to sign David Shepler’s nondisclosure agreement. IBM wanted to keep Harry Friedman and his minions in the dark, as much as possible, about Watson’s strengths and vulnerabilities. And Friedman didn’t want the clues escaping onto the Internet before they aired on television. This meant that even if Lindsay defeated Watson, he wouldn’t be able to brag about it to the Quiz Bowl community. For his crowd, this would be the equivalent of besting Kobe Bryant in a one-on-one game of hoops, then having to pretend it hadn’t happened.

Even so, Lindsay came with a clear strategy to defeat Watson. He quickly saw that Watson mastered factoids but struggled with humor and irony, so he steered clear of Watson-friendly categories. He figured Watson would clean up on Name that Continent, picking out the right landmasses for Estado de Matto Grosso (“What is South America?”) and the Filchner Ice Shelf (“What is Antarctica?”). The category Superheroes Names through Pictures looked much more friendly to humans. Sure enough, Watson was bewildered by clues such as “X marks the spot, man, when this guy opens his peeper” (“What is cyclops?”). Band Names also posed problems for Watson because the clues, like this one, were so murky: “The soul of a deceased person, thankful to someone for arranging his burial” (“What is the Grateful Dead?”). If the clue had included the lead guitarist Jerry Garcia or a famous song by the band, Watson could have identified it in an instant. But clues based on allusions, not facts, left it vulnerable.

More important, since the currency they were playing with was worthless, Lindsay decided to bet the maximum on each Daily Double. If he blew it, he lost nothing. And since he wasn’t on national television, his reputation wouldn’t suffer. As he put it, “There’s no societal fear.” Yet if he won his big bets, he’d be positioned to withstand Watson’s inevitable charges through categories it understood. “I knew he would go on tears,” Lindsay said. “I had to build up big leads when I had the chance.” He
aced his big bets and ended up thrashing Watson three times, once scoring an astronomical $59,999 of funny money. (The Jeopardy single-game record was $52,000 until Ken Jennings crushed it, winning $75,000 in his thirty-eighth game.)

Fortunately for Lindsay, he got Watson on what soon appeared to be a very bad day for the bionic star. The speech defect returned. When naming “one of the two monarchies that border China,” the computer said, “What is Bhutand?” The game judge, Karen Ingraffea, consulted with David Shepler. From the observation room, Ferrucci could see them talking but could not hear a word. Shepler nodded grimly. Then he delivered the verdict to Todd Crain. Again Watson was docked, this time $1,000.

“This is silliness!” Ferrucci said.

His concern deepened as Watson started to strike out on questions that should have been easy. One Final Jeopardy clue, in the category 20th-Century People, looked like a cinch. It said: “The July 1, 1946, cover of Time magazine featured him with the caption, ‘All matter is speed and flame’” (“Who is Albert Einstein?”). Watson displayed its top answers on its electronic panel. They were all ridiculous, and to the machine’s credit, it had rock-bottom confidence in them. First was Time 100, a list of influential people that at one time included Einstein. But Watson should have known that the clue was asking for a “him,” not an “it.” For more than two years, Ferrucci’s language programmers had been preparing the machine to parse these clues. They had defined and mapped the twenty-five hundred things Jeopardy clues ask about. The most common of these LATs were looking for a male person, a “he.” Determining that this clue was asking for a man’s name should not have been so hard.

Watson’s second choice, even more absurd, was David Koresh, the founder of the apocalyptic Branch Davidian cult near Waco, Texas. Koresh appeared on the May 3, 1993, cover of Time, days after burning down his compound and immolating everyone in it, including himself. No doubt the “flame” in the clue led Watson to Koresh. But Koresh was not born until thirteen years after Einstein appeared on the Time cover. Watson’s other stabs were “stroke” and the painter Andrew Wyeth.

At this point, Ferrucci’s frustration boiled over. He wasn’t so bothered by the wild guesses, like David Koresh. The system had come up with a few answers that were somehow connected to the clue—a common magazine cover or flame. The confidence engine had done its job. After studying them, it had found little to go on and declared them worthless. “Watson’s low-confidence answers are just garbage,” Ferrucci had told the contestants earlier.

But why didn’t Watson find the right answer? For a computer with access to millions of documents and lists, the July 1, 1946, cover profile in the nation’s leading newsmagazine shouldn’t be a deep mystery.

Ferrucci concluded that something was wrong with Watson and he wanted the team in the War Room at Hawthorne to get working on it right away. Yet even in one of the world’s leading technology companies, it wasn’t clear how to send the digital record of the computer’s misadventures through the Internet. Ferrucci asked Eric Brown, then Eddie Epstein, and then Brown again: “How do I get the xml file to Hawthorne?” For Ferrucci, this failed game was brimming with vital feedback. It could point the Hawthorne team toward crucial fixes. The idea that his team could not respond immediately to whatever ailed Watson filled him with dread. Just imagine if Watson reprised this disastrous performance in its nationwide debut with Jennings and Rutter. “HOW DO I GET THIS FILE TO HAWTHORNE?” he shouted. No one had a quick answer. Ferrucci continued to thunder while, on the other side of the window, Todd Crain, Watson, and the other Jeopardy players blithely continued their game. (Watson, for one, was completely unfazed.) Finally Brown confirmed that he could plug a thumb drive into one of Watson’s boxes, download the game data, and e-mail it to the team in Hawthorne. It promised to be a long night in the War Room, as the researchers diagnosed Watson’s flops and struggled to restore its cognitive mojo.

Cloistered in a refrigerated room on the third floor of the Hawthorne labs stood another version of Watson. It turned out that the team needed two Watsons: the game player, engineered for speed, and this slower, steadier, and more forgiving system for development. The speedy Watson, its algorithms deployed across more than 2,000 processors, was a finicky beast and near impossible to tinker with. This slower Watson kept running while developers rewrote certain instructions, shifted out one algorithm for another, or refined its betting strategy. It took forty minutes to run a batch of questions, but it could handle two hundred at a time. Unlike the fast machine, it created meticulous records, and it permitted researchers to experiment, section by section, with its answering process. Because the team could fiddle with the slower machine, it was always up-to-date, usually a month or two ahead of its speedy sibling. After the debacle against Lindsay, IBM could only hope that the slower, smarter Watson wouldn’t have been so confused.

Within twenty-four hours, Ferrucci’s team had run all of that day’s games on the slow machine. The news was encouraging. It performed 10 percent better on the clues. The biggest difference, according to Eric Brown, was that some of the clues were topical, and speedy Watson’s most recent data came from 2008. “We got creamed on a couple categories that required much more current information,” he said.

Other recent adjustments in the slow Watson helped it deal with chronology. Keeping track of facts as they change over time is a chronic problem for AI systems, and Watson was no exception. In the recent sparring session, it had confused a mid-nineteenth-century novel for a late-twentieth-century pop duo. Yet when Ferrucci analyzed the slower Watson’s performance on the problematic Oliver Twist clue, he was relieved to see that a recent tweak had helped the machine match the clue to the right century. This fix in “temporal reasoning” pushed the Pet Shop Boys answer way down its list, from first to number 79. Watson’s latest top answer—“What is magician?”—was still wrong but not as laughable. “It still knows nothing about Oliver Twist,” Ferrucci wrote in a late-night e-mail.

While Ferrucci and a handful of team members attended every sparring match in the winter of 2010, Jennifer Chu-Carroll generally stayed away. For her, their value was in the data they produced, not the spectacle, and much less the laughs. As she saw it, the team had a long list of improvements to make before autumn. By that point, the immense collection of software running Watson would be locked down—frozen. After that, the only tinkering would be in a few peripheral applications, like game strategy. But the central operations of the computer, like those of other mission-critical systems, would go through little but testing during the months leading up to the Jeopardy showdown. Engineers didn’t dare tinker with Space Shuttle software once the vessel was headed toward the launch pad. Watson would get similar treatment.

With each sparring session, however, the list of fixes was getting longer. For each fix, the team had to weigh the time it would take against the possible gain in performance. “It’s triage,” Chu-Carroll said. During one sparring session, for example, Watson mispronounced wienerschnitzel, neglecting to say the “W” as a “V.” Was it worth the trouble to fine-tune its German phonetics? Not unless someone could do it in a hurry.

In one Final Jeopardy, Watson inched closer to the fix-it threshold. Asked to identify the sole character in the American Film Institute’s list of the fifty greatest heroes who was not portrayed by a human, the computer came back with “Who is Buffy the Vampire Slayer?” The audience laughed, and Todd Crain slapped his forehead, saying, “Oh, Watson, for the love of God!”

Still, solving that clue would have been a formidable challenge. Once Watson found the list of heroes, it would have had to carry out fifty separate searches to ascertain that each of the characters, from Atticus Finch to James Bond, Indiana Jones, and Casablanca’s Rick Blaine, was human. (It wouldn’t necessarily be that easy, since most documents and databases don’t note a protagonist’s species.) During that search, presumably, it would see that thirty-ninth on the list was a collie, a breed of dog (and therefore not human), and wo
uld then display “Who is Lassie?” on its electronic screen. Would the lessons gained in learning how to spot the dog in a long list of humans pay off elsewhere? Probably not.

That raised another question for the harried team. If Watson had abysmally low confidence in a Final Jeopardy response, as was the case with the Pet Shop Boys and Buffy the Vampire Slayer, would it be better to say nothing? If it was in the company’s interest to avoid looking stupid, suppressing wild guesses might be a good move. This dilemma did not arise with the regular Jeopardy clues. There, if Watson lacked confidence in an answer, it simply refrained from buzzing. But in Daily Doubles and Final Jeopardy, contestants had to bet before seeing the clue. Humans guessed when they didn’t know the answer. This is what Watson was doing, too. But its chronic shortage of common sense made its guesses infinitely dumber. In the coming weeks, the IBM team would calculate the odds of a lucky guess for each of Watson’s confidence levels. While Jeopardy executives, eager for entertainment and high ratings, would no doubt favor the occasional outrageous guess, IBM had other priorities. “At low levels of confidence, I think we’ll just have it say it doesn’t know,” said Chu-Carroll. “Sometimes that sounds smarter.”

Mathematics was one category where the IBM machine could not afford to look dumb. The company, after all, was built on math. However, the Jeopardy training data didn’t include enough examples to educate Watson in this area. Of the more than seventy-five thousand clues Eric Brown and his team studied, only fourteen involved operations with fractions. A game strategist wouldn’t dwell on them. But for IBM, there was more at risk than winning or losing a game. To prepare Watson for math, the team might have to put aside the statistical approach and train the machine in the rules and lingo of arithmetic.

‹ Prev Next ›