The independent roles that the cortex and limbic system play in voice has also been beautifully illustrated by neurosurgeons who, when plotting a route through delicate brain tissue to reach a buried lesion, often use an electrode to stimulate tissues under the cortex to determine their function (so that they know what not to cut or mangle for fear of causing blindness, deafness, paralysis, or death). The patients remain wide awake throughout (since the brain has no pain receptors) and are able to answer the doctor’s questions about bodily sensation or movement elicited by the electrode. When doctors get deep into the brain and stimulate the limbic system, particularly the amygdala, patients behave exactly like Hess’s cats, producing cries of raw fear or anger, laced with what one neurosurgeon called “uncontrolled swearing.”12 The instant the electrode is withdrawn, their utterances return to polite G-rated calm.
* * *
Vocal emotion is not shaped purely by a top-down policing action of the cortex, though. The “animal” limbic system has nerve axons of its own that project upward into the cortex and thus (in the parlance of neuroscience) “talk to” the higher brain regions. Damasio mapped these circuits and showed that emotions, gut feelings, instincts, and intuitions play a far greater role than previously realized in “rational” decision making. To be fully human, we need emotions, which, after all, evolved to help us stay alive and pass along our genes. But sometimes our emotions override the rational brain entirely. When this kind of mutiny occurs, the results, in terms of voice, can be calamitous, giving rise to an array of extraordinary, emotion-based vocal disorders.
One of the more dramatic is dysphonia, a clenching of the vocal cords across the windpipe that restricts phonation, producing a strangled sound.13 Acute sufferers lose the power of speech altogether. Brain scans show that dysphonic patients have highly aroused amygdalae and (as with Tourette’s patients) quieted cortical areas that ordinarily damp down amygdala response. In some cases, a single, emotionally cataclysmic event sets the syndrome off—as with the nearly four-decades-long episode that interrupted the career of British singer Shirley Collins, a leading figure in the folk music revival of the 1960s. On albums like The Power of the True Love Knot (1968), Collins conjured a delicate, unschooled soprano that she would lift into a spine-shivering falsetto. But in the late 1970s, when she was approaching forty, Collins appeared in a musical in London, performing alongside her husband, Ashley Hutchings, bassist for the folk band Fairport Convention. Early in the play’s run, Hutchings announced that he was in love with another actress in the play, and was leaving Collins. Sharing a stage, night after night, with her sexual rival, Collins suffered a trauma that centered, perhaps unsurprisingly, in a part of her that had always been preternaturally expressive of her emotions: her voice. “My voice just—my throat locked,” she later said. “Some nights I could manage a few notes, sometimes nothing came out at all when I opened my mouth.” Diagnosed with dysphonia, Collins abandoned singing and remained silent, as an artist, for the next two decades, taking odd jobs to support herself and her two children.14
Not all emotional vocal disorders result in shutting down the voice. Puberphonia is a syndrome that befalls men in their late teens and early twenties, after they have undergone all the normal bodily changes associated with puberty—including the enlargement and thickening of the vocal cords that ordinarily deepen the voice. Puberphonic voices, however, don’t deepen; they retain their high, light, childlike pitch and timbre. In rare cases, a hormonal problem is the cause (the vocal cords, unresponsive to testosterone, fail to enlarge). But the malady is more often psychological—which is to say, emotional—in origin. The puberphonic patient unconsciously pulls the larynx high in the neck, tilting it forward, which both increases tension on the vocal cords and reduces the size of the throat’s resonating chamber, giving the voice a high, childlike pitch and timbre. In therapy, sufferers report a fear of growing up, leaving childhood and assuming adult responsibilities. Michael Jackson was believed to be a lifelong sufferer.15
Such “bottom-up,” limbic influence on the voice is not confined to dramatic disorders like Tourette’s, dysphonia, and puberphonia. Activity in the limbic system leaks into the most mundane speech, all the time, for better and for worse. Someone asking for the salt from a dinner companion will manifest entirely different emotional voicings, depending on whether the speaker is talking to a spouse of thirty years, a work rival, or an alluring first date. Indeed, keeping emotion out of the voice can be really hard. In the fall of 2008, during the global subprime mortgage meltdown, I spent a whole day in my home office, unable to work as I absorbed news of the looming apocalypse. Around 3:30, I heard my nine-year-old son arrive home from school (we couldn’t see each other; my office is down the hall and around a corner from the front door). Eager to shield him from my anxiety, I called out in a voice that I tried to make sound like my normal, cheerful, after-school greeting: “Hi! How was school?”
His response: “What’s wrong?”
As this would suggest, humans have evolved exquisitely well-tuned hearing for detecting when someone is attempting a vocal masquerade. Evolutionary biologist Richard Dawkins and zoologist John Krebs, in a now classic 1978 paper, suggest how. They point out that deceptive signaling is, itself, an evolutionary adaptation, a trait that developed in our earliest animal ancestors, to reap survival and reproductive benefits.16 (We’ve already seen how hostile mammalian and avian vocalizations are built upon size bluffing through lowered pitch and noisy growling—a “dishonest signal.”) Dawkins and Krebs said that such false signaling is found in all animal communication: the colors flashed by butterflies, the calls of crickets, the pheromones released by moths and ants, the body postures of lizards, and our acoustic signals. Nature is deceitful. Creatures will do what they can to not die—at least until they’ve succeeded in winning a mate and passing along their genes.
But at the same time, Dawkins and Krebs tell us, the receivers of deceptive signals—the would-be dupes—undergo their own coevolutionary “selection pressure” for detecting false communications. The coevolution of voice and ear (which began after the lungfish emerged onto land and began to adapt its underwater hearing apparatus for the detection of airborne sounds) initiated a biological “arms race” (in Dawkins and Krebs’s memorable phrase). The “manipulating” vocalizer evolves, over vast spans of evolutionary time, finer and finer means for dissembling, by acquiring greater neurological control over the vocal apparatus. Meanwhile, the listener, who has his own survival concerns, gets better at picking out the particular blend of pitch, rhythm, timbre and volume that marks the vocalizer as a deceiver. This compels the sender to further refine his “manipulations,” which creates further pressure on the receiver to improve his acoustic “mindreading.” What you end up with, after hundreds of millions of years of such coevolutionary one-upmanship, is an animal (us) who can detect, in just a few syllables, evidence of emotional panic in a father, betrayal in a spouse, lust in a dinner companion, or disapproval in a boss.
* * *
Scientific inquiry into precisely how the voice encodes such subtle emotions was, for all of the twentieth and much of the early twenty-first centuries, sporadic at best, nonexistent at worst. The paucity of research was mostly due to the nearly insuperable difficulties of investigating affective states in our species. Even labeling specific vocal emotions is difficult: what to one person sounds like fear can sound, to another, like anger. Is that “joy” or merely “enthusiasm”—or “excitement”? In the study of dog, bird, or squirrel monkey vocalizations, the problem is negligible, because of the ease in matching their expressive noises to the actions they take. But we humans are (thanks to our oversized cortex and the ways it interacts with our limbic system) far more complicated.
Imagine you are a voice researcher specializing in emotion. Even supposing you can faithfully recognize and label highly specific vocalizations, how do you make your human subject produce a real, true, felt emotional voice sound on cue? Say you want t
o study “anger.” Or “sadness.” Or “pride.” How about “jealousy”? Never mind the truly fascinating, complicated, blended emotions like “Guilt-mixed-with-a-bit-of-passive-aggression.” Even supposing that you manage somehow to elicit one perfect sample of a given vocalization (and can recognize it for what it is), how do you get your subject to reproduce that exact sound, over and over again? (Science demands repeatability, to make sure you’re not getting a one-off anomaly.) Imagine trying to determine how such sounds are produced by the complex action of the various laryngeal muscles. Not easy, when your test subject is pin-cushioned with EMG needles—electrodes pushed through the neck into the larynx muscles to monitor the tensing and relaxation that control pitch. (“Give me a vocalization that says, ‘Restful ease.’ ”)
Given such difficulties, it is perhaps not surprising that very few scientists of the last century were willing to devote their career to the study of vocal emotion. The great exception is Klaus Scherer, a German-born social psychologist who is, at age seventy-five, the indisputable world authority on voice and emotion. He earned this accolade because he, almost alone among scientists, was stubborn and obsessive enough to master the wide array of disciplines (anatomy, physiology, neuroscience, phonetics, linguistics, auditory physics, psychology, computer programming) necessary to make a dent in the subject.
Born in 1943, near Cologne, Scherer traces his fascination with the subject to his teen years when he acquired one of the first commercially available tape recorders and began producing radio plays and documentaries with his high school friends. This, along with a love of the BBC radio soap opera The Archers, impressed upon Scherer how voices, even when isolated from all other social cues (facial expression, gestures, clothing), vividly conjure specific people, their personality, character, and emotions.17
He began formal study of the voice in 1968, when he enrolled in the PhD program in social psychology at Harvard. His initial interest was in voice and personality, how traits like “charisma” and “persuasiveness” are communicated by vocal acoustics. Scherer organized test subjects into mock juries to argue cases and measured their personalities, using peer ratings, to determine who was dominant, who was passive, and how that was manifest in the vocal signal. That these traits were mostly dependent on just two predictable variables (volume and pace) was both a surprise to Scherer—and a disappointment. He was in the midst of this research (which he found increasingly “boring”) when he attended a talk by visiting lecturer Paul Ekman, a University of California, San Francisco, psychologist known for his pathbreaking study of facial expressions. In visits to a remote tribe in Papua New Guinea, Ekman documented how the tribe communicated basic emotions—happiness, anger, disgust, sadness—with expressions identical to people in London, Paris, New York, or anywhere else, supporting an idea first noted by Darwin in The Expression of the Emotions in Man and Animals.18 Fascinated, Scherer approached Ekman after the lecture and mentioned his research on the voice. Ekman encouraged Scherer to drop personality research and focus on vocal emotion. “I’m the face man,” Ekman told him, “but maybe you’re the voice man.”19
Scherer decided to become just that. After completing his PhD in 1970, he took a professorship at Giessen University and began designing experiments. Like Ekman, Scherer took a Darwinian approach, seeing the emotional channel of the voice as a mechanism, selected and perfected over millions of years, for promoting the survival and reproduction of our species, and thus a universal trait that should sound the “same,” in terms of emotional prosody, no matter what language someone was speaking. He found early support for this in an experiment that used test subjects from Europe, Asia, and the United States who listened to a speaker expressing an array of states (fear, anger, joy, surprise, sadness). The listener-judges, despite not knowing the language, accurately identified the specific emotions 66 percent of the time: strong evidence that vocal emotions are, like facial expressions, universal. You actually know this from watching subtitled films where a Japanese or German or French actor’s line readings “fit” exactly the emotional prosody of your native tongue—otherwise, what you see the actors doing, and hear them saying, would clash with the subtitles in a way that would make the movie unwatchable.
But when it came to quantifying specifics of the vocal signal that convey precise emotions, Scherer encountered challenges very different to those confronted by Ekman. Facial expressions can be photographed, filmed, and closely scrutinized, the muscle groups readily mapped and compared across individuals, cultures, and species. Not so the voice, whose signal results from an extraordinarily complex interaction between diverse body parts (most of them hidden inside the body) acting upon invisible air molecules. To further complicate matters, the emotion-bearing air vibrations also encode the speech signal (all those vowel formants and consonants that comprise words), and other elements like the pitch and pace variations that mark noun and verb phrases—that is, the linguistic prosody that helps to convey meaning. From this richly layered, complex signal, emotions must somehow be teased out.
Even harvesting data for research was fraught with difficulties. The most “ecologically valid” data were from recordings of real-life emotional situations, like cockpit recordings of pilots in life-threatening situations, live news reports of fatal disasters (like the Hindenburg crash where the radio announcer, in a live broadcast caught on tape, has a vocal meltdown as he witnesses passengers and ground crew being immolated), harassed callers to tech support lines, and even excitable contestants on TV game shows. But such data tends to be of short duration and poor sound quality (recorded through telephones, crackling cockpit PA systems, and small TV speakers). It also requires an investigator to rely on subjective inferences about what emotion a speaker is feeling. Does the radio announcer’s cry of “Oh, the humanity!” as he watches the Hindenburg go up in flames indicate “sadness,” “horror,” “fear,” or some other emotion—or a blend of all of them? Greater experimental control, Scherer found, is offered by “induction studies,” in which stress reactions are produced in test subjects by exposing them to stimuli (like unsettling pictures or films), or asking them to complete stressful tasks, like guiding a helicopter avatar in a videogame under time pressure, while they vocalize. But induction studies yield relatively weak emotional responses (the pilot of an actual disabled airplane barreling toward a hangar makes different sounds than a gamer manipulating pixels), and, again, labeling the vocalizer’s internal state can be problematic. “In spite of using the same procedure for all participants,” Scherer wrote in a 2002 review paper, “one cannot necessarily assume that similar emotional states are produced in all individuals.”20
The best method, Scherer concluded, is one that was used for the first-ever investigation into emotion in everyday speech, a study that used actors speaking scripted content. That experiment, conducted in 1931 at the University of Iowa, was by Gladys Lynch, a PhD candidate who wanted to identify the vocal parameters that distinguish good from bad acting. She recruited twenty-five trained thespians, and an equal number of nonactors, to read aloud prose passages with widely differing emotive content: a dull technical manual; a Galsworthy play about angry, striking factory workers; and a Seán O’Casey play in which a mother laments her child’s death. Meticulous analysis of the pitch, pace, and volume of each recorded voice revealed that, for all the variations that the professional actors introduced into the lines to supply a certain theatrical liveliness and originality to their performances, all the voices followed an identical pattern of rises and falls—as if each written passage contained not only the literal meaning encoded in the words, but an emotional meaning encoded in the melody that any human voice, theatrically trained or not, will express.21 The study strongly supported the Darwinian notion of a universal emotional grammar in our species. But critics argued that actors performing scripted content yield questionable data because they exaggerate obvious emotional cues and miss more subtle ones. Scherer, however, defended the use of actor-portrayal studies
after his research of the mid-1970s showed that listener-judges, across languages and cultures, could accurately identify highly specific vocal emotions produced by actors—proof that many aspects of these simulated vocal signals were encoding emotions accurately.
But Scherer added a fascinating proviso. The actors in such studies would ideally have been trained in the performance technique developed by the nineteenth-century Russian acting teacher Konstantin Stanislavsky, later called the Method, and popularized by Hollywood stars including Marlon Brando, James Dean, Robert De Niro, and Meryl Streep. Such actors eschew the practice of classically trained thespians like Laurence Olivier, whose technique consisted of closely observing, then scrupulously mimicking, outward emotional expression.22 Method actors recall (and relive) personal, emotionally salient events from their own lives and allow their internal reactions to take the spontaneous form that emerges in behavior. This can lead to breathtakingly genuine-seeming emotional expression, as is clear from Brando in A Streetcar Named Desire, De Niro in Taxi Driver, or Streep in The Deer Hunter (or any movie she’s ever been in)—performances that revolutionized acting.
This Is the Voice Page 9