The Most Human Human

Home > Nonfiction > The Most Human Human > Page 17
The Most Human Human Page 17

by Brian Christian


  Computability theory, Ackley says, has the mandate “Produce correct answers, quickly if possible,” whereas life in practice is much more like “Produce timely answers, correctly if possible.” This is an important difference—and began to suggest to me another cornerstone for my strategy at the Turing test.

  Uh and Um

  When trying to break a model or approximation, it’s useful to know what is captured and not captured by that model. For instance, a good first start for someone trying to prove they’re playing a saxophone, and not a synthesizer made to sound like a saxophone, would be to play non-notes: breaths, key clicks, squawks. Maybe a good start for someone trying to break models of language is to use non-words: NYU philosopher of mind Ned Block, as a judge in 2005, made a point of asking questions like “What do you think of dlwkewolweo?” Any answer other than befuddlement (e.g., one bot’s “Why do you ask?”) was a dead giveaway.

  Another approach would be to use words that we use all the time, but that historically haven’t been considered words at all: for example, “um” and “uh.” In his landmark 1965 book, Aspects of the Theory of Syntax, Noam Chomsky argues, “Linguistic theory is concerned primarily with an ideal speaker-listener, in a completely homogeneous speech-community, who knows its language perfectly and is unaffected by such grammatically irrelevant conditions as memory limitations, distractions, shifts of attention and interest, and errors (random or characteristic) in applying his knowledge of the language in actual performance.” In this view words like “uh” and “um” are errors—and, say Stanford’s Herbert Clark and UC Santa Cruz’s Jean Fox Tree, “they therefore lie outside language proper.”

  Clark and Fox Tree, however, disagree. Most languages have two distinct terms, just as English does: If they are simply errors, why would there be two, and why in every language? Furthermore, the usage pattern of “uh” and “um” shows that speakers use “uh” before a pause of less than a second, and “um” before a longer pause. This information suggests two things: (1) that the words are far from interchangeable and in fact play distinct roles, and (2) that because these words are made before the pauses, speakers must be anticipating in advance how long the following pause will be. This is much more significant than mere “error” behavior, and leads Clark and Fox Tree to the conclusion “that uh and um are, indeed, English words. By words, we mean linguistic units that have conventional phonological shapes and meanings and are governed by the rules of syntax and prosody … Uh and um must be planned for, formulated, and produced as parts of utterances just as any other word is.”

  In a purely grammatical view of language, the words “uh” and “um” are meaningless. Their dictionary entries would be blank. But note that the idealized form of language which Chomsky makes his object of study explicitly ignores “such grammatically irrelevant conditions as memory limitations … [and] actual performance.” In other words, Chomsky’s theory of language is the computability theory of Turing’s era, not the complexity theory that followed. Very similarly idealized, as it happens, are chatbots’ models of language. Yet it turns out—just as it did in computer science—that there’s a tremendous amount happening in the gap between the “ideal” process and the “actual performance.”

  As a human confederate, I planned to make as much of this gap as possible.

  Satisficing and Staircase Wit

  Economics, historically, has also tended to function a bit like computability theory, where “rational agents” somehow gather and synthesize infinite amounts of information in the smallest of jiffies, then immediately decide and act. Such theories say this and that about “costs” without really considering: consideration itself is a cost! You can’t trade stocks except in real time: the longer you spend analyzing the market, the more the market has meanwhile changed. The same is true of clothes shopping: the season is gradually changing, and so is fashion, literally while you shop. (Most bad fashion is simply good fashion at the wrong time.)

  The Nobel laureate, Turing Award winner, and academic polymath—economics, psychology, political science, artificial intelligence—Herbert Simon coined the word “satisficing” (satisfying + sufficing) as an alternative to objective optimization/maximization.

  By the lights of computability theory, I’d be as good a guitar player as any, because you give me any score and I can hunt around for the notes one by one and play them …

  English composer Brian Ferneyhough writes scores so outrageously complicated and difficult that they are simply unperformable as written. This is entirely the point. Ferneyhough believes that virtuosic performers frequently end up enslaved by the scores they perform, mere extensions of the composer’s intention. But because a perfect performance of his scores is impossible, the performer must satisfice, that is, cut corners, set priorities, reduce, simplify, get the gist, let certain things go and emphasize others. The performer can’t avoid interpreting the score their own way, becoming personally involved; Ferneyhough’s work asks, he says, not for “virtuosity but a sort of honesty, authenticity, the exhibition of his or her own limitations.” The New York Times calls it “music so demanding that it sets you free”—in a way that a less demanding piece wouldn’t. Another part of what this means is that all performances are site-specific; they never become fungible or commoditized. As musicologist Tim Rutherford-Johnson puts it, Ferneyhough “draws so much more into the performance of a work than simple reproduction of a composer’s instructions; it’s hard to imagine future re-re-re-recordings of the same old lazy interpretations of Ferneyhough works, a fate that too much great music is burdened with today.”

  For Bernard Reginster, authenticity resides in spontaneity. Crucially, this would seem to have a component of timing: you can’t be spontaneous except in a way that keeps up with the situation, and you can’t be sensitive to the situation if it’s changing while you’re busy making sense of it.

  Robert Medeksza, whose program Ultra Hal won the Loebner Prize in 2007, mentioned that the conversational database he brought to the competition for Ultra Hal was smaller than ’07 runner-up Cleverbot’s by a factor of 150. The smaller database limited Ultra Hal’s range of responses, but improved the speed of those responses. In Medeksza’s view, speed proved the decisive factor. “[Cleverbot’s larger database] actually seemed to be a disadvantage,” he told an interviewer after the event. “It sometimes took [Cleverbot] a bit long to answer a judge as the computer [couldn’t] handle that amount of data smoothly.”

  I think of the great French idiom l’esprit de l’escalier, “staircase wit,” the devastating verbal comeback that occurs to you as you’re walking down the stairs out of the party. Finding the mot juste a minute too late is almost like not finding it at all. You can’t go “in search of” the mot juste or the bon mot. They ripen and rot in an instant. That’s the beauty of wit.

  It’s also the beauty of life. Computability theory is staircase wit. Complexity theory—satisficing, the timely answer, as correct as possible—is dialogue.

  “Barge-In-Able Conversation Systems”

  The 2009 Loebner Prize competition in Brighton was only a small part of a much larger event happening in the Brighton Centre that week, the annual Interspeech conference for both academic and industry speech technology researchers, and so ducking out of the Loebner Prize hall during a break, I immediately found myself in the swell and crush of some several thousand engineers and programmers and theorists from all over the globe, rushing to and from various poster exhibitions and talks—everything from creepy rubber mock-ups of the human vocal tract, emitting zombie versions of human vowel sounds, to cutting-edge work in natural language AI, to practical implementation details concerning how a company might make its automated phone menu system suck less.

  One thing you notice quite quickly at events like this is how thick a patois grows around every field and discipline. It’s not easily penetrated in a few days’ mingling and note taking, even when the underlying subject matter makes sense. Fortunately, I had a guide and
interpreter, in the form of my fellow confederate Olga. We wandered through the poster exhibition hall, where the subtlest of things about natural human conversation were named, scrutinized, and hypothesized about. I saw a poster that intrigued me, about the difficulty of programming “Barge-In-Able Conversational Dialogue Systems”—which humans, the researcher patiently explained to me, are. “Barge-in” refers to the act of leaping in to talk while the other person is still talking. Apparently most spoken dialogue systems, like most chatbots, have a hard time dealing with this.

  Notation and Experience

  Just as Ferneyhough is interested in the differences “between the notated score and the listening experience,” so was I in the differences between idealized theories of language and the ground truth of language in practice, the differences between logs of conversations and conversation itself.

  One of my friends, a playwright, once told me, “You can always identify the work of amateurs, because their characters speak in complete sentences. No one speaks that way in real life.” It’s true: not until you’ve had the experience of transcribing a conversation is it clear how true this is.

  But sentence fragments themselves are only the tip of the iceberg. A big part of the reason we speak in fragments has to do with the turn-taking structure of conversation. Morse code operators transmit “stop” to yield the floor; on walkie-talkies it’s “over.” In the Turing test, it’s traditionally been the carriage return, or enter key. Most scripts read this way: an inaccurate representation of turn-taking is, in fact, one of the most pervasive ways in which dialogue in art fails to mirror dialogue in life. But what happens when you remove those markers? You make room both for silences and for interrupts, as in the following, an excerpt of the famously choppy dialogue in David Mamet’s Pulitzer-winning Glengarry Glen Ross:

  LEVENE: You want to throw that away, John …? You want to throw that away?

  WILLIAMSON: It isn’t me …

  LEVENE: … it isn’t you …? Who is it? Who is this I’m talking to? I need the leads …

  WILLIAMSON: … after the thirtieth …

  LEVENE: Bullshit the thirtieth, I don’t get on the board the thirtieth, they’re going to can my ass.

  In spontaneous dialogue it’s natural and frequent for the participants to overlap each other slightly; unfortunately, this element of dialogue is extremely difficult to transcribe. In fiction, playwriting, and screenwriting, the em dash or ellipsis can signify that a line of dialogue got sharply cut off, but in real life these severances are rarely so abrupt or clean. For this reason I think even Mamet’s dialogue only gets turn-taking half right. We see characters stepping on each other’s toes and cutting in, but as soon as they do, the other character stops on a dime. We don’t see the fluidity and negotiation often present in those moments. The cuts are too sharp.

  We squabble or tussle over the floor, fade in and out, offer “yeah’s” and “mm-hmm’s” to show we’re engaged,2 add parentheticals to each other’s sentences without trying to stop those sentences’ flow, try to talk over an interruption only to yield a second later, and on and on, a huge spectrum of variations. There are other notations that some playwrights and screenwriters use, involving slashes to indicate where the next line starts, but these are cumbersome to write, and to read, and even they fail to capture the range of variation present in life.

  I recall going to see a jazz band when I was in college—it was on the large side, for a jazz band, with a horn section close to a dozen strong. The players were clearly proficient, and played tightly together, but their soloing—it was odd—was just a kind of rigid turn-taking, not unlike the way people queue in front of a microphone to ask a question to a lecturer at the end of a lecture: the soloist on deck waited patiently and expectantly for the current soloist’s allotted number of bars to expire, and would then play for the same number of bars him- or herself.

  There’s no doubt that playing this way avoids chaos, but there is also no doubt that it limits the music.

  It may be that enforced turn-taking is at the heart of how a language barrier affects intimacy, more so than the language gap itself. As NBC anchor and veteran interviewer John Chancellor explains in Interviewing America’s Top Interviewers:

  Simultaneous translation is good because you can follow the facial expressions of the person who’s talking to you, whereas you can’t in consecutive translation. Most reporters get consecutive translation, however, when they’re interviewing in a foreign language, because they can’t really afford to have simultaneous translation. But it’s very difficult to get to the root of things without the simultaneous translation.

  So much of live conversation differs from, say, emailing, not because the turns are shorter, but because there sometimes are not definable “turns” at all. So much of conversation is about the extremely delicate skill of knowing when to interrupt someone else’s turn and when to “pass” on your own turn, when to yield to an interruption and when to persist.

  I’m not entirely sure we humans have this skill down. If you’re like me, it’s impossible to watch much of the broadcast news in America: the screen split into four panels, where four different talking heads are shouting over each other from one commercial break to the next. Perhaps part of the reason computer software appears to know how to converse is that we sometimes appear not to.

  It’s very telling that this subtle sense of when to pause and when to yield, when to start new threads and cut old threads, is something in many cases explicitly excluded from bot conversations.

  Decision Problems

  It is this ballet and negotiation of timing that linguists and programmers alike have kept out of their models of language, and it is precisely this dimension of dialogue in which words like “uh” and “um” play a role. “Speakers can use these announcements,” linguists Clark and Fox Tree write, “to implicate, for example, that they are searching for a word, are deciding what to say next, want to keep the floor, or want to cede the floor.”

  We are told by speaking coaches, teachers, parents, and the like just to hold our tongue. The fact of the matter is, however, filling pauses in speech with sound is not simply a tic, or an error—it’s a signal that we’re about to speak. (Consider, as an analogue, your computer turning its pointer into an hourglass before freezing for a second.) A big part of the skill it takes to be a Jeopardy! contestant is the ability to buzz in before you know the answer, but as soon as you know you know the answer—that buzz means, roughly, “Oh! Uh …,” and its successful deployment is part of what separates champions from average players. (By the way, this is part of what has been giving IBM researchers such a hard time preparing their supercomputer Watson for serious competition against humans, especially for short questions that only take Alex Trebek a second or two to read.)

  In 2000, MIT researchers Nikko Ström and Stephanie Seneff presented at Interspeech a paper which found that leveraging the word “um” could make spoken-dialogue telephone menus much easier and more intuitive to use. At the 2009 Interspeech conference in Brighton, a group of four researchers from Kyoto University presented findings at the poster session to the effect that in a number of situations, the timing of human speech offers a computer system more information than the content.

  In part, computer programs’ history of not dealing well with questions of timing goes back to the original theories about what computer programs are. As formalized in the early days of computability theory, programs performed what are known as “decision problems.” The idea was that you would feed a program a given input, specifying where the input began and ended. The program would then process that input, taking however much time it happened to need, at the end of which it would output a clearly defined result.

  This model of language, this presumed one-to-one mapping of “inputs” to “outputs,” leads to all sorts of terrific opportunities for mischief making, I discovered, when it comes to conversing with chatbots. ELIZA’s pattern matching, for instance, isn’t sophisticated
enough to consider whether the pattern it’s trying to match is one utterance or many. When I begin a conversation as follows—

  ME: I am Brian.

  ELIZA: Do you enjoy being Brian?

  —and reply “I, er, guess so,” I get back the noncommittal response “I see.” But if I—as many instant-message users do—break that response up across multiple carriage returns, chaos ensues:

  ME: I

  ME: er

  ME: guess so

  ELIZA: Tell me more …

  ELIZA: Tell me more …

  ELIZA: Tell me more …

  Surprisingly, the rigidity of the input-output binding is even stronger in many of the newer chatbots. If you talk to Cleverbot, 2005 and 2006 Loebner Prize winner, at its website, each time you hit the enter key, the input box locks and cannot be clicked or typed into until Cleverbot has finished responding:

  ME: oh hi

  CLEVERBOT: How are you?

  ME: um,

  CLEVERBOT: Is that good or bad?

  ME: well,

  CLEVERBOT: How?

  If you talk to Elbot, the 2008 Loebner Prize winner, online, each time you hit the enter key the screen refreshes, and one of Elbot’s responses is visible:

  ME: umm

  ELBOT: I’m starting to wonder: Who convinced you to come visit me?

 

‹ Prev