Children who suffer strokes after age nine or ten have a very different prognosis. “Aphasias that develop from this age on… commonly leave some trace behind which the patient cannot overcome,” Lenneberg reported, “there will be odd hesitation pauses, searching for words, or the utterance of an inappropriate word or sound sequence.”42 The linguistic brain, he added, “behaves as if it had become set in its ways.”43 Which is why, by the time we enter high school, foreign languages have to be learned through conscious effort (“je suis, tu es, il est, nous sommes, vous etes…”), and certain accent markers can, no matter how hard we try, be just aboot impossible to lose.
* * *
Some of the most convincing (if upsetting) evidence for critical periods in speech are “feral children”: those rare cases of babies raised with little or no exposure to human voice sounds. The most notorious modern example is “Genie,” the pseudonym for a grotesquely abused girl held captive, from infancy on, by her deranged father. Strapped to a child’s toilet seat, day and night, and isolated in a soundless back room of the family’s suburban Los Angeles home, she was not spoken to (except in animal-like barks and snarls) throughout all the critical stages of language development. She escaped her captivity in 1970, at age thirteen but, despite the efforts of speech therapists and psychologists, was able to acquire only a handful of words, which she could not string into sentences. Her voice, through disuse, had also failed to develop properly: respiration, phonation, pitch control, and articulation were all badly stunted. “She had been beaten for vocalizing,” Susan Curtiss, a linguist who treated the girl, told author Russ Rymer in 1993. “So when she spoke she was very tense, very breathy and soft. She couldn’t be understood. There was a lot of sound distortion, as though she had cerebral palsy, but there was no evidence of muscle or nerve damage.”44 That Curtiss should have likened Genie’s distorted speech to that of someone with cerebral palsy (a disease that attacks the basal ganglia) was, in retrospect, prescient: the basal ganglia’s critical role in both learning language, and in coordinating breathing, phonation, and articulatory movements, had not yet been discovered.
Today, Genie’s permanent mutism remains the most salient, and disturbing, evidence that humans must hear human voices talking, within a particular time window, in order to master speech and language.
* * *
Because it is unethical to raise a child under twenty-four-hour scrutiny in a language lab, and feral children are (fortunately) rare, most of what we know about the precise stages of speech acquisition derives from baby diaries kept by attentive parents, usually scientists. In the early twentieth century, German psychologist William Stern and his wife, Clara, spent eighteen years obsessively documenting the lives of their three children from birth to around ten years old45 and provided clear milestones for first cries (at birth), coos and gurgles (at six to eight weeks), babbling (at six to nine months), first words (around one year), and, finally, articulate speech (three years old). But some of the most fascinating insights we possess on language acquisition date to a half century earlier, when Charles Darwin documented the first year in the life of his son, William, born in 1839.46
Darwin noted that, by his first birthday, William had yet to utter his first real word—although he did use an invented term for food (“mum”). Especially intriguing to Darwin, however, was the prosody, the music, William used when asking for mum: “he gave to it… a most strongly marked interrogatory sound at the end.” (That is, he raised the pitch across the word.) “He also gave to ‘Ah,’ which he chiefly used at first when recognizing any person or his own image in a mirror, an exclamatory sound, such as we employ when surprised.” Darwin, here, recorded a previously unremarked upon fact about how babies learn to talk: the music of speech—its expressive prosody—emerges before words. Tunes before lyrics. This insight would have strong implications for Darwin’s later theory of how language evolved in humans (as we’ll see); but for now it is enough to note that modern research has confirmed Darwin’s observation that a language’s specific prosody, its unique melody and rhythm, emerge in a baby’s babbling before those sing-songy articulations are molded into specific words. Indeed, an eight-month-old’s variegated babbling so closely conforms to the rhythmic stress and pitch patterns of its native tongue that it sounds uncannily like actual speech. You can see this for yourself on YouTube, which is replete with videos of parents jokily “conversing” with their nonsense-spewing infants.47 But it’s no joke: lots of crucial learning is going on during those parent-baby exchanges. And the adults are not necessarily the ones in charge, as was demonstrated in clever experiments of the mid-eighties.48
Over large closed-circuit TV screens (this was before Skype, but the participants were, effectively, Skyping), mothers exchanged vocal noises with their eight-week-old babies. Parent and child fell into the expected exchange of musical expressions of attention and affection: the baby’s coos, sighs, and gurgles triggering the mother’s singsong Motherese, which in turn triggered more coos from the babies, and more Motherese. But when the researchers secretly played a recorded image of the baby made minutes earlier, and the child’s sounds were now out of sync with the mother’s vocalizations, the women’s Motherese vanished. Their voices lowered to normal pitch and their speech took on an adult pace and complexity. The researchers concluded that Motherese, however instinctual, is also part of a feedback loop driven by the baby’s vocal sounds. In short, two-month-olds conspire with parents in their own language tutoring—and, with their own baby voices, help guide it.
* * *
In those exchanges, the baby is also learning an aspect of voice crucial to our species’ unique powers of cooperation and goal sharing: conversation, a behavior far more complex than you might realize. But think about it. Back-and-forth conversation, although an unplanned activity ungoverned by any prearranged rules of turn duration, topic, or overall length, is amazingly orderly. Indeed, a team of sociologists in the early 1970s49 showed that when we talk with others, we not only overlap very rarely (only 5 percent of the time), but we speak with almost no gap between when we fall silent and our interlocutor starts speaking. The elapsed time between turns is, on average, 200 milliseconds—too short for our ears to detect, so that conversation sounds continuous, one voice seamlessly picking up where the other left off. But that 200-millisecond gap also happens to be too short for a listener to decode what was said to him and to make a reply. Conversation, in other words, should be impossible.
To understand why I say this, you need to know something about how we, as listeners, turn the incoming ripple of air vibrations that emerge from a speaker’s lips into meaningful speech. Because when we hear someone say “Pass the salt, please,” we are hearing not an “airborne alphabet,” not individual auditory “letters,” that “spell out” a message. We are not even hearing individual vowels and consonants—or phonemes (the technical term for speech sounds)—which are, after all, only segments of sound that, for the purposes of analysis, speech scientists arbitrarily divide into manageable units (remember when I said that speech is a ribbon of sounds with no gaps in between?). Who is to say where the l in “salt” ends and the t sound actually begins? In reality, speech is a connected flow of ever-changing, harmonically rich musical pitches determined by the rate at which the phonating vocal cords vibrate, the complex overtone spectrum as filtered by the rapidly changing length and shape of throat, mouth, and lips (to produce the vowels), interspersed with those bursts of noise: the hisses, hums, pops, that we call consonants. The effect is not dissimilar—in fact, it’s identical—to an instrumental group playing ringing harmonic piano chords and single guitar notes, punctuated by noisy cymbal crashes, snare hits, and maracas: our voices, quite literally, comprise a mini-symphonic ensemble. It is our brain that turns this incoming stream of sonic air disturbances, this strange music, into something deeply meaningful.
Researchers at Haskins Labs in New York City offered the best explanation for how our brain does this when the
y published, in 1967, their “motor theory of speech perception,” which emerged from the conceptual breakthrough that spoken language doesn’t begin as sound;50 it begins as a series of dancelike bodily movements, gestures, built from a set of mental instructions originating in Broca’s area and another region, a few inches back in the left hemisphere, Wernicke’s area, which is where we store our mental dictionary—the words that Broca’s area assembles and inserts into sentences. This “mentalese” is only then converted into speech by muscle commands delivered to the vocal organs,51 which produce a complicated soundwave rich in linguistic information. This sound wave is perceived as meaningful speech not by the listener’s auditory centers, but by her motor centers. That is, the brain circuitry responsible for animating her vocal organs is activated by your sound wave which triggers, in her brain, an internal, unvoiced, silent, neural speech that exactly mirrors the motor instructions you sent to your larynx and articulators, thereby informing her what lip shapes and tongue heights and articulatory targets you hit to produce your request. Reverse engineering your vocal sound wave in her brain, she runs it from her motor circuits up through her Broca’s and Wernicke’s areas to retrieve the vowels and consonants, words and syntax that made up your utterance, and thus arrives at the thought formed in your frontal lobes. And she hands you the salt.
All of which suggests that every act of verbal give-and-take—no matter how hostile, accusatory, contradictory, or aggressive—is also one of extraordinary empathy and intimacy, bringing the brains of speaker and listener into a perfect synchrony, and symmetry, of neural firing. At least, on the level of language. How we communicate, perceive, and interpret the emotional channel of the voice, which is superimposed on, or interwoven with, the linguistic channel, is another story altogether, and one that I’ll save for later. For now, we must return to my earlier, seemingly nonsensical, claim that conversation should be impossible.
I said so because of the long and laborious cognitive stages your dinner companion had to grind through in order to process the sentence “Pass the salt, please.” To make her own verbal reply, she then has to decide on a response (“No—I think you use too much salt” or “Are you sure you wouldn’t prefer some pepper?”)—a snatch of mentalese she must pass through her language centers and motor cortex to produce the muscle commands that animate her vocal organs to voice her reply. All of this takes, at minimum, about 600 milliseconds:52 three times longer than the 200 milliseconds in which people routinely reply. To respond that fast means that we must be gearing up to speak well before the other person has finished talking, and that, furthermore (since we so rarely overlap), we must have a very good idea when our interlocutor is going to fall silent.
Researchers first theorized that we use subtle language clues to figure out when a speaker’s “turn” will end, as well as reading telltale hand or eye movements that also help us anticipate when our turn is coming. But there was a problem with this explanation: people speak with no-gap/no-overlap precision over the phone, when they can’t see each other. And linguistic clues can’t be how we orchestrate turn taking because, as we’ve noted, it takes too long to process what is said to us and then to formulate and execute our reply. Instead, research shows, seamless conversation is possible thanks to the music of speech: the changing pitch of the voice as it follows a sentence’s melodic trajectory; the duration of the vowels (which determines the rhythm of an utterance); and the changes in volume from loud to soft. In a 2015 study at the Max Planck Institute for Psycholinguistics and recounted in linguist N. J. Enfield’s book How We Talk: The Inner Workings of Conversation,53 researchers asked students at Radboud University, in the Netherlands: “So you’re a student?” and “So you’re a student at Radboud University?” In the first instance, students responded “Yes” the instant the speaker finished the word “student,” with no overlap. In the second, students were equally precise, saying “Yes” after University, with no gap and no overlap. They could do this only because they used the music of speech to time their reply. In the shorter question, the pitch on student rose sharply, tripling from around 120 cycles per second to 360 cps, signaling a query, and the last vowel was extended (studeeent?), giving the listener a head start for dropping in a “Yes” within the 200-millisecond time window. In the second instance, the pitch on student didn’t rise, and its final syllable was clipped, rhythmic and melodic hints that there were more words to come, cuing the listener to hold off on responding.
* * *
The subspecialty of linguistics that studies conversation is called “discourse analysis,” and it looks at all aspects of the prosody that guides our verbal give-and-take, including the communicative power of silences. A pause lasting longer than the usual 200 ms between turns suggests that the person about to speak is giving special consideration to her reply, but there’s a limit to how long a person can remain silent. Pauses that stretch beyond one second cease to seem polite and create doubt whether a reply is ever coming. Anything longer than a two- or three-second pause is socially unendurable, creating an awkwardness that forces the original speaker to start babbling to fill the silence. Journalists like me take advantage of this impulse by deliberately failing to ask a follow-up question when a subject is clearly struggling to hold back a secret, a tactic as socially excruciating for the reporter as the interviewee.
All of this suggests that conversation is conducted, like a piece of music, according to an agreed-upon “time signature,” a kind of internal metronome that determines the number of syllables produced per second. When one speaker deviates from the established rhythm, she signals disagreement. You “hear” this regularly on cable news channels where one speaker’s assertion, made in a slow, calm voice is met with a response spoken at high speed—and possibly even interrupting. You don’t have to hear what either person is saying to know that they disagree. Vocal pitch operates the same way. Like jazz musicians taking a solo, speakers can improvise freely in terms of content (that is, they can say whatever they want, and at whatever length), but only within a predetermined and mutually agreed upon pitch, or musical “key.” A sudden switch in register signals discord.
This tonal dimension of conversation was first described by discourse analyst David Brazil in the mid-1980s.54 Brazil showed that we converse in three broadly distinct pitches (high, mid, and low). If we disagree with something that someone has said to us, we start speaking at a pitch several semitones higher than the voice of the person who just left off. Brazil calls this a “contrastive” pitch. The mid register—in which the person making a reply starts speaking on the same vocal pitch as the person who just stopped talking—signals agreement. We reply in the low voice, in a register below that of the previous speaker, when agreement is so strong that it is, as Brazil called it, “a foregone conclusion,” that is, hardly worth stating.
Conversation, in other words, is not isolated voices making separate points in an alternating series of monologues; it is a creative collaboration, orchestrated not according to linguistic, but prosodic, rules. It is a form of singing, a duet in which two brains choreograph, through variations in pitch, pace, and rhythm, the exchange of ideas. Ideally, conversation in our species enhances our advancement through the statement of a thesis, the response of an antithesis, and the eventual arrival at a synthesis—a new idea that emerges from the creative, and civil, exchange of (musically orchestrated) ideas.
Anne Wennerstrom, a discourse analyst, notes that the terms we use to describe successful versus unsuccessful conversations frequently draw on musical metaphors. Good conversations are “in synch,” or “in tune,” or “harmonious”; the speakers “didn’t miss a beat.” Bad conversations are “out of synch,” “out of tune,” “discordant,” with the speakers “off their stride” and “on a different wavelength.”55 These metaphors are no coincidence. To a degree that we rarely, if ever, acknowledge, conversation is music—a music learned, and reinforced, in those earliest, seemingly meaningless, exchanges between parents and babb
ling babies.
* * *
Babies produce their first spoken word around age one, after which they start adding new words at breathtaking speed. In one study, two-year-olds were shown mysterious objects like an apple corer and told just once that it is a “dax.”56 Though never again told this word, they recalled it weeks later when asked to point to a picture of the “dax” on a screen. Cognitive scientists call this incredibly speedy word acquisition (and retention) “fast-mapping,” and it allows children to stock Wernicke’s mental dictionary at the blinding rate of roughly one new word every two hours—a rate they keep up for fifteen years, which is how the average child, by the end of high school, has accumulated some sixty thousand words in her mental dictionary. Lest this number fail sufficiently to amaze, consider how difficult it would be to remember sixty thousand telephone numbers or internet passwords. We remember words so well because they are not random assemblages of information units, like phone numbers or passwords. Words carry meaning and we are clearly a species that craves meaning.
The true power of language emerges when babies start combining words, to make complex meanings. They first do this around their second birthday, blurting out two-word utterances, like “All wet!” or “No bed!” or “Play iPad!” Simple as these utterances seem, they betray a remarkable syntactic sophistication, including an understanding that word order affects meaning (“Lucy hit” means something different than “Hit Lucy”). Chomsky argues that word order could not be learned from the “impoverished” input of half-heard, rapid-fire, degraded adult speech, and so must be inborn. But linguists like Snow, Ferguson, Garnica, and Fernald have shown how slow, repetitious, singsong Motherese teaches exactly the kind of telegraphic two-word utterances, heavy on word order, that toddlers end up speaking as their first grammatical constructions.
This Is the Voice Page 6