Final Jeopardy

Home > Other > Final Jeopardy > Page 10
Final Jeopardy Page 10

by Stephen Baker


  By early 2008, Blue J’s scores were rising. On the Jennings Arc posted on the wall of the War Room, it was climbing toward the champion—but was still 30 percent behind him. If it continued the pace of the last six months, it might reach Jennings by mid-2008 or even earlier. But that wasn’t the way things worked. Early on, Ferrucci said, the team had taught Blue J the easy lessons. “In those first months, we could put in a new algorithm and see its performance jump by two or three percent,” he said. But with the easy fixes in, the advances would be smaller, measured in tenths of a percentage.

  The answer was to focus on Blue J’s mistakes. Each one pointed to a gap in its knowledge or a misunderstanding: something to fix. In that sense, each mistake represented an opportunity. The IBM team, working in 2007 with Eric Nyberg, a computer scientist at Carnegie Mellon, had designed Blue J’s architecture for what they called blame detection. The machine monitored each stage of its long and intricate problem-solving process. Every action generated data, lots of it. Analysts could carry out detailed studies of the pathways and performance of algorithms on each question. They could review each document the computer consulted and the conclusions it drew from it. In short, the team could zero in on each decision that led to a mistake and use that information to improve Blue J’s performance.

  The researchers were swimming in examples of misunderstandings and wrong turns. Blue J, after all, was failing on half of the clues. But which ones represented larger patterns? Fixing those might enhance its analysis in an entire category. One South America clue, for example, appeared to signal a glitch on Blue J’s part in analyzing geography—an important category in Jeopardy. The clue asked for the country that shared the longest border with Chile. Blue J came back with the wrong answer: What is Bolivia? The correct response (What is Argentina?) was its second choice.

  Analyzing the clue, researchers saw that Blue J had received conflicting answers from two algorithms. The one specializing in geography had come back with the right answer, Argentina, whose 5,308-kilometer border with Chile dwarfed the 861-kilometer Chilean-Bolivian frontier. But another algorithm had counted references to these countries and their borders and found a lot more talk about the Bolivian stretch. (Chile and Bolivia have been engaged in a border dispute since the 1870s, generating a steady stream of news coverage.) Lacking any other context, this single-minded algorithm suggested Bolivia—and Blue J unwisely trusted it. “The computer was paying more attention to popularity than geography,” Ferrucci said. Researchers went on to tinker with the ratios underlying Blue J’s judgment. They instructed it to give more weight to the geography in that type of question and a bit less to popularity. Then they tested the system on a large batch of similar geography clues. Blue J’s performance improved. They then ran it on a group of random clues to find out if the adjustment affected Blue J’s performance elsewhere, perhaps turning correct answers into mistakes. That happened all too often. But this time the change helped. Blue J’s performance inched ahead another tiny fraction of a percent.

  The Jeopardy clues, nearly all of them from the J! Archive Web site, were the test bed for this stage of Blue J’s education. Eric Brown, Ferrucci’s top lieutenant, oversaw this cache along with Chu-Carroll. Brown was serious and circumspect. He got his doctorate at the University of Massachusetts and graduated, in 1996, at the dawn of the dot-com boom. Citing family obligations, he turned down a job offer from Infoseek, one of the early search engines. Two years later, the Walt Disney Company paid $430 million for 42 percent of Infoseek, turning many of the early employees—including the one who grabbed the job Brown had been offered—into multimillionaires. “It’s a sad story for me,” Brown said. “I ran into him a few years later at a conference. He was retired.”

  From the very beginning, Brown kept tight control of the Jeopardy data. He distributed two thousand clues at a time, which the team used to train Blue J. The risk they faced, as in any statistical analysis, was that they’d fine-tune the machine too precisely to each batch of questions. This tendency to model too closely to a training set is known as overfitting, and it’s a serious problem.

  Anyone who has ever studied a foreign language knows all about it. Students inevitably overfit to the French or Spanish or Mandarin that their teacher speaks. They adjust to her rhythms and syntax and come to associate that single voice with the language itself. A trip to Paris or Beijing often brings a rude awakening. In Blue J’s education, each training set was a single teacher. When the computer started to score well on a training set, the researchers would test it on Jeopardy clues it had never seen before. This was a blind set of data, a few thousand clues that no one but Brown had seen. Each time Blue J ventured from its comfortable clues into an unfamiliar set of data, its results would drop about 5 percent. But still, its overall scores were rising. Brown would release another training set, and the process would start over.

  The broader question, naturally, was whether the Jeopardy challenge itself was one giant exercise in overfitting. Jeopardy, in a sense, was a single training set of 185,000 clues, including general knowledge and a mix (that Ferrucci’s team quickly quantified) of puzzles, riddles, and the like. If Blue J eventually mastered the game and even defeated Ken Jennings and Brad Rutter in its televised showdown, would its expertise be too specific, or esoteric, for the broader world of business? Would it be flummoxed once it ventured outside its familiar grid of thirty clues? After all, Jeopardy champions were hardly famous for running corporations, mastering global diplomacy, or even managing large research projects. They tended to be everyday people—real estate agents, teachers, software developers, librarians—all with one section of their mind specially adapted—or possibly overfitted—to a TV quiz show.

  David Ferrucci spent his days swimming in statistics. They defined every aspect of the Jeopardy project. Blue J’s analysis of data was statistical. Its confidence algorithms and learning programs were fed entirely by statistics. Its choice of words and its game strategy were guided by similar analysis, all statistical. Blue J’s climb up the Jennings Arc was a curve defined by statistics, and when it got into sparring sessions with humans, sometime in 2009, its record would be calculated the same way. The Final Match was the rare exception—a fact that haunted Ferrucci from the very start. Blue J’s fortunes would be defined more by chance than probability. One game, after all, was a minuscule test set, statistically meaningless. A bit of bad luck on a couple of Daily Doubles, and Blue J could lose—even if statistics demonstrated that it usually won.

  Ferrucci was constantly analyzing the statistical methodology of teaching and testing the bionic player. One day in the spring of 2008, he came up with a question no one had asked before. Was there any variation, from year to year, in the Jeopardy clues? He asked Eric Brown, the guardian of Blue J’s training set. Did Blue J fare better against the clues from some years than others?

  It was odd, looking back, that such a simple question had gone unasked for so long. It could be important. Even a change in the clue writers or new directions from the producer could usher in new styles or subject matter. By opening up the format to more popular culture in 1997, Harry Friedman had already demonstrated that Jeopardy, unlike chess, was a game that changed with the times. Did it evolve in a predictable way? If so, Blue J had to be ready.

  Brown’s team proceeded to analyze Blue J’s performance against the clues, year by year. They were stunned to see that the machine’s scores plummeted when answering clues from 2003 and remained at that lower level. It was as if the machine got dumber, by about 10 percent. As Blue J answered the newer questions, its precision stayed constant. In other words, it didn’t make more mistakes. But with a lower level of confidence, it didn’t buzz as often. Blue J was more confused.

  The IBM team called this shift “climate change.” For weeks, researchers pored over Jeopardy data, trying to figure out why in season 20, from September of 2003 to the following July, the questions suddenly became harder for Blue J. Was it more puzzles or puns? They couldn�
��t tell.

  That twentieth season was the one in which Ken Jennings began his remarkable run. Was Jeopardy toughening the clues for Jennings and unwittingly making the game harder for Blue J? That seemed unlikely, especially since it would be difficult to make the game harder for the omniscient Jennings without also ratcheting it up for his competitors. Ferrucci and his team asked Friedman about the change. He said he didn’t know—and added that at this point IBM certainly knew more about Jeopardy clues than he did.

  Climate change meant that as Blue J prepared for its first matches with human Jeopardy champs—so-called sparring sessions—two-thirds of its training set was too easy. It was like a student who crams for twelfth-grade finals only to see, late in the game, that he’s been consulting eleventh-grade textbooks. From Blue J’s perspective, the game had just gotten considerably harder.

  5. Watson’s Face

  IN THE FALL of 1992, a young painter named Joshua Davis moved from Colorado to New York City and enrolled at the prestigious Pratt Institute. After a year, he switched from painting to illustration, where there were better career opportunities. “I thought, ‘I’ll still paint. It’ll just be for the Man,’” Davis said. But when he sent his work to two book publishers, hoping to line up illustration contracts for children’s books, the response was essentially, as he put it, “‘Thanks but no thanks, and like, who the fuck are you?’”

  Davis didn’t take it too hard. His self-esteem was strong enough to withstand a knock or two. A bit later a friend at school steered him toward the digital world. “He said, ‘Oh, don’t worry, man, there’s this whole Internet thing now. Like books are dead.’” Davis said he was “totally naive” at that point. “I said, ‘Cool. Print’s dead. Fantastic!’” He promptly bought an old computer, but it lacked an operating system. So he went to a bookstore and bought one last artifact from the printed world: a manual for the new open-source system called Linux. A diskette he found at the back of the book contained the software. “I was like, ‘Score!’” he said.

  Davis didn’t know he was about to tackle what he calls the “world’s hardest operating system.” But as he taught himself about user interface design, programming, and video graphics, he had an epiphany. He wasn’t going to use computers simply to create designs more quickly or to reach more people. The technology itself, following his instructions, would generate the art. “At the time I thought, ‘The Internet is my new canvas,’” he said.

  His first corporate job was for Microsoft. He designed visual applications for Internet Explorer 4, which debuted in 1997. For the next few years, he became a leader in the new field of generative art, using programs to combine data into colors and patterns that could morph into countless variations. For this he harnessed movements from nature, such as wind, flowing water, and swarming birds and insects. He even turned his body into an evolving canvas. He had his entire left arm tattooed with the twenty glyphs of the Mayan calendar, the swirling designs running up his right arm depicted Japanese wind, and his back carried images of water. Fire, he said, would eventually cover his chest. He had birds tattooed on his neck, one of them dedicated to his daughter, Kelly Ann.

  Davis built a thriving studio, with offices in Manhattan and Chicago, and a long list of clients, from Volkswagen and Motorola to rap luminaries Sean “Puff Daddy” Combs and Kanye West. He eventually moved from the city to a hundred-year-old house with a barn in Mineola, on Long Island. As his success grew, he gave more thought to where his work fit in the history of art. In 2008, for a lecture series on dynamic abstraction, he focused on Jackson Pollock, the abstract artist famous for dripping paint on canvases from a stepladder. “Here’s a guy who says, ‘I’m going to paint, but I’m going to use gesture.’” Davis waved his arms to illustrate the movement. “Wherever the paint goes, the paint goes.” Not one to sell himself short, he said he felt like an extension of Pollock. “I’m creating systems where I establish the paints, the boundaries, and the colors. But where it goes is where it goes. It’s like controlled chaos.”

  As Davis learned more about Pollock, his feelings of kinship only grew. He read that the other artist had also left the city, moved to Long Island, and worked in a barn. “It was, like, sweet!” Davis said. “How did that work out?”

  It was around that time, in October 2008, that Davis got a call from an art director at Ogilvy & Mather, the international advertising agency. IBM, he learned, was building a computer to take on human champions in Jeopardy. How would he like to create the machine’s face?

  During the first year of Blue J’s development, few at IBM thought much about the computer’s physical presence or its branding. A pretty face would be irrelevant if the team couldn’t come up with a workable brain. But by late summer of 2008, Ferrucci’s team was getting close. One August day, Harry Friedman and the show’s supervising producer, a former Jeopardy champion named Rocky Schmidt, visited the Yorktown labs for their first look at the bionic player.

  As the group gathered in one of the windowless conference rooms at the Yorktown lab, Ferrucci walked them through the computer’s cognitive process, explaining how it came up with answers and why, on occasion, it flubbed them so badly. He explained that the hardware—what would become Watson’s body—wasn’t yet ready to deliver timely answers. But the team had led the computer through a game of Jeopardy, had recorded its answers, and then created a simulation of the game by loading the answers into a laptop. With that, Friedman and Schmidt watched the new contestant in action. Friedman later said that he had been “blown away” by the computer’s performance.

  The conversation, according to Noah Syken, a media manager at IBM, quickly turned to logistics and branding. If the computer required the equivalent of a roaring data center to play the game, where would all that machinery fit on the Jeopardy set? And how about all the noise and heat it would generate? One possibility might be to set up its hulking body on the Wheel of Fortune set, next door, and run the answers to the podium. But that raised a bigger question: What would viewers see at that podium? No one had a clue.

  The following month, as Lehman Brothers imploded, car companies crashed, and the world’s financial system appeared to teeter on the verge of collapse, IBM’s branding and marketing team worked to develop the personality and message of the Jeopardy-playing machine. It would need a face of some sort and a voice. And it had to have a name.

  An entire corporate identity unit at IBM specialized in naming products and services. A generation earlier, when the company still sold machines to consumers, some of the names this division dreamed up became iconic. “PC” quickly became a broad term for personal computers (at least those that weren’t made by Apple). ThinkPad was the marquee brand for top-of-the-line business laptops. And for a few decades before the PC, the Selectric, the electric typewriter with a single rotating type ball (which could “erase” typos with space-age precision) epitomized quality for anyone creating documents. With IBM’s turn toward services, the company risked losing its contact with the popular mind—and its identity as a hotbed of innovation.

  What’s more, a big switch had occurred since the 1990s. It used to be that the most advanced machinery was found at work. Children whose parents went to offices would sometimes get a chance to play with the adding machines there, along with the intercoms, fancy photocopiers, and phones with illuminated buttons for five or six different lines. But at the dawn of the new century, the office appeared to lose its grip on cool technology. Now people often had snazzier gadgets at home, and in their pockets, than at work. Companies like Apple and Google targeted consumers and infused technology with fun, zip, even desire. Tech companies that served the business market, by contrast—Oracle, Germany’s SAP, Cisco, and IBM—tended to stress the boring stuff: reliability, efficiency, and security. They were valuable qualities, to be sure, but deadening for a brand. IBM needed some sizzle. It was competing for both investors and brainpower with the likes of Google, Apple, Facebook—even the movie studio Pixar. It had to establish itself
in the popular imagination as a company that took risks and was engaged in changing the world with bleeding-edge technology. The Jeopardy challenge, with this talking IBM machine on national television matching wits with game-show luminaries, was the branding opportunity of the decade. The name had to be good.

  Was THINQ the right choice, or perhaps THINQER? How about Exaqt or Ace? Working with the New York branding firm VSA Partners, IBM came up with dozens of candidates. The goals, according to a VSA summary, were to emphasize the business value of the technology, create a close tie to IBM, steer clear of names that were “too cute,” and lead the audience “to root for the machine.”

  One group of names had strong links to IBM. Deep Logic evoked Deep Blue, the computer that mastered chess. System/QA recalled the iconic mainframe System/360. Other names stressed intelligence. Qwiz, for example, blended “Q,” for question, with “wiz” to suggest that the technology had revolutionized search. The pronunciation—quiz—fit the game show theme. Another choice, nSight, referred to “n,” representing infinite possibilities. And EureQA blended “eureka” with the Q-A for question-answering. Another candidate “Mined,” pointed to the machine’s datamining prowess.

  On the day of the naming meeting, December 12, all of the logic behind the various choices was promptly ignored as people focused on the simplest of names in the category associated with IBM’s brand: Watson. “It just felt so right,” said Syken. “As soon as it came up, we knew we had it.” Watson invoked IBM’s founder. This was especially fitting since Thomas J. Watson had also established the research division, originally on the campus of Columbia University, in 1945. The Watson name was also a nod to the companion and chronicler of Sherlock Holmes, the brilliant fictional sleuth. In those stories, of course, Dr. Watson was clearly the lesser of the two intellects. But considering public apprehension about all-knowing machines, maybe it wasn’t such a bad idea to name a question-answering computer after an earnest and plodding assistant.

 

‹ Prev