Final Jeopardy

Page 9

by Stephen Baker

The dumb answers didn’t matter—at least not yet. They had to do with the computer’s discernment, which was still primitive. The main concern for the Jeopardy team at this stage was whether the correct answer popped up anywhere at all on its list. This was its measure of binary recall, a metric that Blue J shared with humans. If a student in geography were asked about the capital of Romania, she might come up with Budapest and Bucharest and not remember which of the two was right. In Blue J’s world, those would be her candidate answers, and she clearly had the data in her head to answer correctly. From Chu-Carroll’s perspective, this student would have passed the binary recall test and needed no more data on capitals. (She just had to brush up on the ones she knew.) In a similar test, Blue J clearly didn’t recognize Britney Spears or do svidaniya as the correct answer. But if those words showed up somewhere on its candidate lists, then it had the wherewithal to answer them (once it got smarter). By focusing on categories where Blue J struck out most often, the researchers worked to fill in the holes in its knowledge base.

Week by week, gigabyte by gigabyte, Blue J’s trove of data grew. But by the standards of consumer electronics, it remained a pip-squeak. The fifth-generation iPods, which were selling five miles down the road, stored 30 to 80 gigabytes. Blue J topped out at the level of a midrange iPod, at some 75 gigabytes. Within a year or two, cell phones might hold as much. But there was a reason for the coziness of Blue J’s stash. The more data it had to rummage through, the longer it took—and it was already painfully slow. What’s more, the easiest and clearest sources for Blue J to digest were lists and encyclopedia entries. They were short and to the point, and harder for their literal-minded pupil to misinterpret. And they didn’t take up much disk space.

While Chu-Carroll wrestled with texts, James Fan was part of a team grappling with a challenge every bit as important: coaxing the machine to understand the convoluted Jeopardy clues. If Blue J couldn’t figure out what it was supposed to look for, the data on its disk wouldn’t make any difference. From Blue J’s perspective, each clue was a riddle to be decoded, and the key was to figure out the precise object of its hunt. Was it a country? A person? A kind of fish? In Jeopardy, it wasn’t always clear.

The crucial task was to spot the word representing what Blue J was supposed to find. In everyday questions, finding it was simple. In “Who assassinated President Lincoln?” or “Where did Columbus land in 1492?” the “who” and “where” point to a killer and a place. But in Jeopardy, where the clues are statements and the answers questions, finding these key words, known as Lexical Answer Types (LATs), was a lot trickier. Often a clue would signal the LAT with a “this,” as in: “This title character was the crusty & tough city editor of the Los Angeles Tribune.” Blue J had no trouble with that one. It identified “title character” as its focus and returned from its hunt with the right answer (“Who is Lou Grant?”).

But others proved devilishly hard. This clue initially left Blue J befuddled: “In nine-ball, whenever you sink this, it’s a scratch.” Blue J, Fan said, immediately identified “this” as the object to look for. But what was “this?” The computer had to analyze the rest of the sentence. “This” was something that sank. But it was not related, at least in any clear way, to vessels, the most common sinking objects. To identify the LAT, Blue J would have to investigate the two other distinguishing words in the clue, “nine-ball” and “scratch.” They led the computer to the game of pool and, eventually, to the answer (“What is a cue ball?”).

Sometimes the LAT remained a complete mystery. The computer, Fan said, had all kinds of trouble figuring out what to look for in this World Leaders clue: “In 1984, his grandson succeeded his daughter to become his country’s prime minister.” Should the computer look for the grandson? The daughter? The country? Any human player would quickly understand that it was none of the above. The trick was looking for a single person whose two roles went unmentioned: a father and a grandfather. To unearth this, Blue J would have had to analyze family relationships. In the end, it failed, choosing the grandson (“Who is Rajiv Gandhi?”). In its list of answers, Blue J did include the correct name (“Who is Nehru?”), but it had less confidence in it.

Troubles with specific clues didn’t matter. Even Ken Jennings only won the buzz 62 percent of the time. Blue J could afford to pass on some. The important thing was to fix chronic mistakes and to orient the machine to succeed on as many clues as possible. In previous weeks, Fan and his colleagues had identified twenty-five hundred different LATs in Jeopardy clues and ranked them by their frequency. The easiest for Blue J were the most specific. The machine could zero in on songs, kings, criminals, or plants in a flash, but most of them were more vague. “He,” for example, was the most common, accounting for 2.2 percent of the clues. Over the coming months, Fan would have to teach Blue J how to explore the rest of each clue to figure out exactly what kind of “he” or “this” it should look for.

It was possible, Ferrucci thought, that someday a machine would replicate the complexity and nuance of the human mind. In fact, in IBM’s Almaden research labs, on a California hilltop high above Silicon Valley, a scientist named Dharmendra Modha was building a simulated brain boasting seven hundred million electronic neurons. Within years, he hoped to map the brain of a cat, then a monkey, and eventually a human. But mapping the human brain, with its hundred billion neurons and trillions or quadrillions of connections among them, was a long-term project. With time, it might result in a bold new architecture for computing that would lead to a new level of computer intelligence. Perhaps then machines would come up with their own ideas, wrestle with concepts, appreciate irony, and think more like humans.

But such a machine, if it was ever built, would not be ready for Ferrucci. As he saw it, his team had to produce a functional Jeopardy machine within two years. If Harry Friedman didn’t see a viable machine by 2009, he would never green-light the man-machine match for late 2010 or early 2011. This deadline compelled Ferrucci and his team to assemble their machine with existing technology—the familiar silicon-based semiconductors, servers whirring through billions of calculations and following instructions from lots of software programs that already existed. In its guts, Blue J would not be so different from the ThinkPad Ferrucci lugged from one meeting to the next. Its magic would have to come from its massive scale, inspired design, and carefully tuned algorithms. In other words, if Blue J became a great Jeopardy player, it would be less a breakthrough in cognitive science than a triumph of engineering.

Every computing technology Ferrucci had ever touched, from the first computer he saw at Iona to the Brutus machine that spit out story plots, had a clueless side to it. Such machines could follow orders and carry out surprisingly complex jobs. But they were nowhere close to humans. The same was true of expert systems and neural networks: smart in one area, dumb in every other. And it was also the case with the Jeopardy algorithms his team was piecing together in the Hawthorne labs. These sets of finely honed computer commands each had a specialty, whether it was hunting down synonyms, parsing the syntax of a clue, or counting the most common words in a document. Beyond these meticulously programmed tasks, each was helpless.

So how would Blue J concoct broader intelligence—or at least enough of it to win at Jeopardy? Ferrucci considered the human brain. “If I ask you ‘36 plus 43,’ a part of you goes, ‘Oh, I’ll send that question over to the part of my brain that deals with math,’” he said. “And if I ask you a question about literature, you don’t stay in the math part of your brain. You work on that stuff somewhere else.” Now this may be the roughest approximation of how the brain works, but for Ferrucci’s purposes, it didn’t matter. He knew that the brain had different specialties, that people instinctively skipped from one to another, and that Blue J would have to do the same thing.

Unlike a human, however, Blue J wouldn’t know where to start. So with its vast resources, it would start everywhere. Instead of reading a clue and assigning the sleuthing work
to specialist algorithms, Blue J would unleash scores of them on a hunt, then see which one came up with the best answer. The algorithms inside Blue J—each following a different set of marching orders—would bring in competing results. This process, a lot less efficient than the human brain, would require an enormous complex of two thousand processors, each handling a different piece of the job.

To see how these algorithms carried out their hunt, consider one of thousands of clues the fledgling system grappled with. In the category Diplomatic Relations, it read: “Of the 4 countries the United States does not have diplomatic relations with, the one that’s farthest north.”

In the first wave of algorithms to assess the clue was a cluster that specialized in grammar. They diagrammed the sentence, much the way a grade school teacher once did, identifying the nouns, verbs, direct objects, and prepositional phrases. This analysis helped to resolve doubts about specific words. The “United States,” in this clue, referred to the country, not the army, the economy, or the Olympic basketball team. Then they pieced together interpretations of the clue. Complicated clues, like this one, might lead to different readings, one more complex, the other simpler, perhaps based solely on words in the text. This duplication was wasteful, but waste was at the heart of the Blue J strategy. Duplicating or quadrupling its effort, or multiplying it by 100, was one way it would compensate for its cognitive shortcomings—and play to its advantage in processing speed. Unlike humans, who instantly understand a question and pursue a single answer, the computer might hedge, launching simultaneous searches for a handful of different possibilities. In this way and many others, Blue J would battle the efficient human mind with spectacular, flamboyant inefficiency. “Massive redundancy” was how Ferrucci described it. Transistors were cheap and plentiful. Blue J would put them to use.

While the machine’s grammar-savvy algorithms were dissecting the clue, one of them searched for its LAT. In this clue about diplomacy, “the one” evidently referred to a country. If this was the case, the universe of Blue J’s possible answers was reduced to a mere 194, the number of countries in the world. (This, of course, assumed that “country” didn’t refer to “Marlboro Country” or “wine country” or “country music.” Blue J had to remain flexible, because these types of exception often occurred.)

Once the clue was parsed into a question the machine could understand, the hunt commenced. Each expert algorithm went burrowing through Blue J’s trove of data in search of the answer. One algorithm, following instructions developed for decoding the genome, looked to match strings of words in the clue with similar strings elsewhere, maybe in some stored Wikipedia entry or in articles about diplomacy, the United States, or northern climes. One of the linguists worked on rhyming key words in the clue or finding synonyms. Another algorithm used a Google-like approach and focused on documents that matched the greatest number of key words in the clue, giving special attention to the ones that surfaced the most often.

While they worked, software within Blue J would compare the clue to thousands of others it had encountered. What kind was it—a puzzle, a limerick, a historical factoid? Blue J was learning to recognize more than fifty types of questions, and it was constructing the statistical record of each algorithm for each type of question. This would guide it in evaluating the results when they came back. If the clue turned out to be an anagram, for example, the algorithm that rearranged the letters of words or phrases would be the most trusted source. But that same algorithm would produce gibberish for most other clues.

What kind of clue was this one on Diplomatic Relations? It appeared to require two independent analyses. First, the computer had to come up with the four countries with which the United States had no diplomatic ties. Then it had to figure out which of them was the farthest north. A group of Blue J’s programmers had recently developed an algorithm focused on these so-called nested clues, in which one answer lay inside another. This may sound obscure, but humans ask this type of question all the time. If someone wonders about “cheap pizza joints close to campus,” the person answering has to carry out two mental searches, one for cheap pizza joints and another for those nearby. Blue J’s “nested decomposition” led the computer through a similar process. It broke the clues into two questions, pursued two hunts for answers, and then pieced them together. The new algorithm was proving useful in Jeopardy. One or two of these combination questions came up in nearly every game. They were especially common in the all-important Final Jeopardy, which usually featured more complex clues.

It took Blue J almost an hour for its algorithms to churn through the data and return with their candidate answers. Most were garbage. There were failed anagrams of country names and laughable attempts to rhyme “north” and “diplomatic.” Some suggested the names of documents or titles of articles that had strings of the same words. But the nested algorithm followed the right approach. It found the four countries on the outs with the United States (Bhutan, Cuba, Iran, and North Korea), checked their geographical coordinates, and came up with the answer: “What is North Korea?”

At this point, Blue J had the right answer. It had passed the binary recall test. But it did not yet know that North Korea was correct, nor that it even merited enough confidence for a bet. For this, it needed loads of additional analysis. Since the candidate answer came from an algorithm with a strong record on nested clues, it started out with higher than average confidence in that answer. The machine proceeded to check how many of the answers matched the question type “country.” After ascertaining that North Korea appeared to be a country, confidence in “What is North Korea?” increased. For a further test, it placed “North Korea” into a simple sentence generated from the clue: “North Korea has no diplomatic relations with the United States.” Then it would see if similar sentences showed up in its data trove. If so, confidence climbed higher.

In the end, it chose North Korea as the answer to bet on. In a real game, Blue J would have hit the buzzer. But being a student, it simply moved on to the next test.

The summer of 2007 turned into fall. Real estate prices edged down in hot spots like Las Vegas and San Diego, signaling the end of a housing boom. Senators Obama and Clinton seemed to campaign endlessly in Iowa. The Red Sox marched toward their second World Series crown of the decade, and Blue J grew smarter.

But Ferrucci noted a disturbing trend among his own team: It was slowing down. When technical issues came up, they often required eight or ten busy people to solve them. If a critical algorithm person or a member of the hardware team was missing, the others had to wait a day or two, or three, by which point someone else was out of pocket. Ferrucci worried. Even though the holidays were still a few months away and they had all of 2008 to keep working, his boss, a manager named Arthur Ciccolo, never tired of telling him that the clock was ticking. It was, and Ferrucci—thinking very much like a computer engineer—viewed his own team as an inefficient system, one plagued with low bandwidth and high latency. As team members booked meeting rooms and left phone messages, vital information was marooned for days at a time, even weeks, in their own heads.

Computer architects faced with bandwidth and latency issues often place their machines in tight clusters. This reduces the distance that information has to travel and speeds up computation. Ferrucci decided to take the same approach with his team. He would cluster them. He found an empty lab at Hawthorne and invited his people to work there. He called it the War Room.

At first it looked more like a closet, an increasingly cluttered one. The single oval table in the room was missing legs. So the researchers piggybacked it on smaller tables. It had a tilt and a persistent wobble, no matter how many scraps of cardboard they jammed under the legs. There weren’t enough chairs, so they brought in a few from the nearby cafeteria. Attendance in the War Room was not mandatory but an initial crew, recognizing the same bandwidth problems, took to it right away. With time, others who stayed in their offices started to feel out of the loop. They fetched chairs and started wo
rking at the same oval table. The War Room was where decisions were being made.

For high-tech professionals, it all seemed terribly old-fashioned. People were standing up, physically, when they had a problem and walking over to colleagues or, if they were close enough, rolling over on their chairs. Nonetheless, the pace of their work quickened. It was not only the good ideas that were traveling faster; bad ones were, too. This was a hidden benefit of higher bandwidth. With more information flowing, people could steer colleagues away from the dead ends and time drains they’d already encountered. Latency fell. “Before, it was like we were running in quicksand,” said David Gondek, a new Ph.D. from Brown who headed up machine learning. Like many of the others on the team, Gondek started using his old office as a place to keep stuff. It became, in effect, his closet.

It was a few weeks after Ferrucci set up the War Room that the company safety inspector dropped by. He saw monitors propped on books and ethernet cables snaking along the floor. “The table was wobbly. It was a nightmare,” Ferrucci said. The inspector told them to clear out. Ferrucci started looking for a bigger room and quickly realized his team members expected cubicles in the larger space. He told them no, he didn’t want them to have the “illusion of returning to a private office.” He found a much larger room on the third floor. Someone had left a surfboard there. Ferrucci’s team propped it at the entrance and sat a tiny toy bird, a bluebird, on top of it. It was the closest specimen they could find to a blue jay.

A war room, of course, was hardly unique to Ferrucci’s team. Financial trading floors and newsrooms at big newspapers had been using war rooms for decades. All of these operations involved piecing together networks of information. Each person, ideally, fed the others. But for IBM, the parallel between the Jeopardy team and what it was building was particularly striking. The computer had areas of expertise, some in knowledge, others in language. It had an electrical system to transmit information and a cognitive center to interpret it and to make decisions. Each member of Ferrucci’s team represented one (or more) of these specialties. In theory, each one could lay claim to a certain patch of transistors in the thinking machine. So in pushing the team into a single room, Ferrucci was optimizing the human brain that was building the electronic one.

‹ Prev Next ›