by Chip Walter
But as primates evolved, more and more neurons called “pyramidal cells” (so called because of their triangular shape) sent out long axons from the rapidly evolving cerebral cortex to connect directly and deeply with the nerve systems that manage the lungs, larynx, face, and tongue.13 This eventually gave us the more conscious control we needed to trip the switches in all of those organs we use to form the words that express our thoughts.14
When you bend over to pick up a big box, you instinctively stiffen your torso as a brace against the weight by trapping the air in your lungs beneath a couple of retractable but very strong muscles called the vocal folds (known to most of us as the vocal cords). These cap your voice box or larynx, the front end of which you see in the mirror every day in the form of your Adam’s apple (women less than men). Exert yourself a little too much and you will grunt because the effort pushes a little of the air out of your lungs and through the folds.
We use the same system to speak, except with considerably more subtlety. Speech begins with inhaling the amount of air our brain has calculated we will need to say what we want to say. Once in our lungs, we then release it in controlled puffs up through the windpipe (trachea) until it collides with the vocal folds at the top of larynx. It is here that we first begin to shape the air to make specific sounds. We may, for example, want to make a buzzing noise. If we do, we produce a zzzzzzz; if we don’t, then we produce an ssssss-like sound. When you sing or hum or talk, it is the air rattling past these folds in rapid bursts that becomes the foundation for what everyone recognizes as your particular voice, a sound so distinct that it cannot be truly duplicated by any other human.
The tighter the folds of muscle, or the smaller and more rigid they are, the higher the timbre of your voice. If the folds are looser or larger, your voice will be deeper. This is one reason why big men generally have lower voices than little girls, although the shapes of our throats, nasal passages, and mouths also have a lot to say about whether we speak in rich baritones, smoky whispers, or nasal twangs. If we get excited and constrict our vocal cords, our voices rise. You might notice that when you are nervous, your voice sounds a little higher than it might normally sound.
While the larynx gives our voices pitch and character, we sculpt sounds into phonemes after they have made their way through our vocal cords and up into our throats. The English language consists of 40 or so phonemes.15 They can be connected in innumerable ways to form every word in the language and plenty that don’t yet exist. (Remember how many new words Shakespeare brought to the English language, all of them pronounceable.) Other languages also use set numbers of phonemes. German has 37; Japanese, 21; and Rotokas, the language of East Papua, a mere 11. No language uses more than 141 phonemes because that represents the outer limit of the sounds we can utter.16
Whatever language we speak, our tongues, lips, and teeth fold, push, and cut the sound waves to shape them into wordmaking phonemes. First the vibrating air reaches into the rounded chamber of our pharynx. Then we use the root, hump, and tip of our tongues to compress sounds or release them, or simply impede them before they escape from our mouths. We form the sounds for “gah” or “kay” or “tee” this way. Say almost any word (read these sentences out loud) and you can feel the rapid movements in your throat, tongue, and lips, all acting in perfect, linear synchrony.
Our lips are the last part of us to form word sounds. And we use them almost as deftly as we do our tongues. Sound out for yourself the subtle differences among “ef” and “vee” or “pee” and “bee.” Only the faintest shifts in our lips make the differences among these sounds. We have this capability because our ancestors developed unusually fat and sensitive lips, well adapted for trying out jungle fruits and their flavors. It is pure evolutionary serendipity that they also turn out to be very good for making extremely explicit sounds.
But the connection between flavor and communication might be more literal than we once suspected. Scientists who have studied mirror neurons have found that when one monkey watches another eat and smack his lips, he, too, will tend to shape his lips into a smacking shape, instinctively reflecting what he is watching, as if somehow he, too, is eating. We do a variation on this when we find ourselves pursing our lips as two lovers lean forward to kiss in a movie. We do it as well when we instinctively try to finish a sentence for someone who is struggling to find the right word.
All of these realignments were, of course, accidental in the ways that evolutionary mutations are. However, if we had hadn’t developed our peculiar pharynx, and traffic cop epiglottis, we couldn’t hope to utter a single word.
This means our pharynx enabled our leap from gesture to speech, but the leap wasn’t as far from gesture as it might first appear. An increasing number of scientists are concluding that somehow the areas of the brain that once exclusively coordinated the manipulation of our hands in space, expanded and evolved to also take on the duties of encoding thoughts into sounds and then manipulating the more than hundred specialized muscles needed to hold a conversation. In other words, our predecessors’ brains learned to apply the same fine control over muscles in their throats that they had already developed to control the intricate muscles in their hands. Does that mean that in time our ancestors learned to gesture, not simply with their hands, but also with sound, using the muscles of their chests, throats, and mouths to manipulate air and make words?
There is no universal agreement on this question. Though every day we grow more adept at mapping the brain, those maps have yet to clearly reveal how language works. Part of the reason is that so many brain cells are crammed within such an exceedingly small space. Cognitive scientist Steven Pinker has pointed out that this could mean various parts of the brain dedicated to handling specific aspects of speech exist as small areas scattered all around the brain. “They might be irregularly shaped squiggles, like gerrymandered political districts. In different people, the regions might be pulled and stretched onto different bulges and folds of the brain.”
PET scanning and functional magnetic resonance imaging (fMRI) do find that as we process speech, areas throughout the brain “light up.” Of course, precisely what causes them to light up is far from clear, partly because language is itself so complex and interconnected that it’s difficult to isolate a single aspect of speech, and partly because the brain is so convoluted, twisted, and turned.
Speech requires brain cells to coordinate the movement of the lungs, throat, mouth, lips, and facial muscles. It demands that we listen, symbolize concepts from both the outside world and the chambers of our own minds, and then apply phonemes in rapid succession to place words in just the correct sequence to deliver a sentence. Saying simply “How are you?” takes immense skill and brainpower.
The Music of Language
Our vocal folds also control the intonation of our voices, the modulations, intensity, expression, and volume that give them personality and add subtle meaning to the words we utter. Linguists call this prosody, a kind of body language for speech. We express it in the growling, the snarling, the singing, or the cooing we fold around our words. In conversation we absorb it in the tones, the rhythm, and the speed of the voices we are listening to.17
Great orators and communicators use prosody to pass along the subtleties of what is on their minds with the tiniest inflection, the smallest modulation, or the sudden hammering of a word so that rings in our ears and hearts like a bell. We all have this ability and we all use it to add new layers of power and meaning to what we say. It is another level of information that brushes our words with sympathy, doubt, confidence, anger, or sorrow. In this sense it is more closely associated with primal calls and cries, imbuing our voices with a kind of music. And the music we make is key in gaining and keeping the attention of others when we are speaking. In fact, the life we give to the words we speak, the emotions, color, weight, and speed we attach to them are central to the thing we call our personality, the cumulative impression we leave on others that distinguishes us from everyon
e else. It is central to who we are.
The emotional and musical side of prosody may explain why, unlike the words and syntax of language, it is processed in the right hemisphere of the brain rather than the so-called verbal left side.18 The left side specializes in the sequences of actions, but the right handles shape and space. So in some sense the right hemisphere of the brain must “see” the intonations of the speech in terms of form and distance: imaginary objects that are close or far away, big or small, tall or flat, colorful or not. But even though different parts of the brain handle different aspects of speech, the content of words and the sound of words are not really separate. Together they combine to create, the meaning that inextricably binds ideas to emotion.
This ability is a testament to the extremely interconnected nature of the brain. With the evolution of the neocortex, the brain needed to send out longer and longer neuronal tendrils to keep new sectors in touch with more ancient ones.
The basal ganglia, for example, comprise an ancient section of our brain, an area that in other animals is exclusively devoted to movement. In reptiles, other mammals, and humans, it influences postures that display dominance and submission or attract the attention of the opposite sex. In us it still handles movements as basic as the way we swing our arms when we walk. We don’t normally swing our arms forward and backward in unison because as far as our basal ganglia are concerned, we are still walking on all fours. It is a legacy of the way our ancestors moved before we stood up.
If a man puffs out his chest or a woman flutters her eyelids or you find yourself crossing your arms in a meeting when you hear something you don’t agree with, your basal ganglia are hard at work. These nerve fibers also affect the expressions of your face—surprise, anger, confusion, interest. In a singles bar basal ganglia must be near overheating from input and activity. If there’s not enough dopamine operating in your basal ganglia, you may shuffle and not swing your arms at all, an incipient warning of a disease such as Parkinson’s. On the other hand, too much dopamine has been linked to the physical tics in Tourette’s syndrome or obsessive-compulsive disorder.
The basal ganglia and other ancient and visceral parts of our brain are also wired into Broca’s area. Unlike Wernicke’s area, Broca’s doesn’t import information. Instead, it controls the expression of what is on our minds, even when we don’t always consciously realize that those things are on our minds. It surrounds and serves and affects parts of the brain that handle body language and the gestures we make when we tell a joke or the expressions on our faces when we are talking with our significant other.
But in human beings, perhaps the most important operation the basal ganglia perform is to help us do the physical work of shaping thoughts into words and sounds before we ever part with them. In Broca’s area we literally talk to ourselves, rapidly preprocessing all of the language before our basal ganglia pass the signals onto our throats, tongues, lips, and lungs to manufacture the thought-sounds we send into the world for others to hear.
It would be nice to know when and precisely how all of these capabilities fused, but we don’t. We have no Homo erectus brains to examine or scan, no early Homo sapiens we can test to learn if they spoke in full sentences or the grunts B movies have imagined they did. But we do know from the mirror neuron work that Rizzolatti and another of his colleagues, Michael Arbib, a biologist and computer scientist at the University of Southern California, have done that other primates possess areas of the brain—the F5 region, for example—that are located in roughly the same sector as Broca’s area in the human brain. In us these areas handle both speech and manual manipulation, which, unlike the uncontrolled cries and calls of wild animals, are things we do intentionally and very consciously.19
In humans these two areas sit, cerebrally speaking, cheek by jowl. In fact, in 1998, Rizzolatti and Arbib wrote, “This new use of vocalization [in speech] necessitated its skillful control, a requirement that could not be fulfilled by the ancient emotional vocalization centers. This new situation was most likely the ‘cause’ of the emergence of human Broca’s area.”
More recently Arbib has concluded that H. habilis and even more so H. erectus enjoyed the benefits of a “proto—Broca’s area” based on an F5-like precursor. This part of H. erectus’s brain might have handled communication that was partly manual and partly facial and oral. In time, Arbib suspects, this early version of Broca’s area gained primitive control of the vocal machinery as far as it had evolved. It began to play a puppeteer’s role, except in this case it made the pharynx dance, not the hands and fingers. And once that happened, language and the brain would have nudged one another’s evolution along, back and forth, as our ancestors experimented with gestures of sound rather than body, creating increasingly complex social situations that accelerated their own need to improve communication.
A crucial part of that progression would have required the development of increasingly refined control of the mechanics we use today to speak—not an inconsequential accomplishment.20 It doesn’t matter that gorillas and chimps have the mental capacity to symbolize some basic concepts (a little more than a hundred by last count). The problem for them is that no amount of coaxing or training allows them to do it with words, for the simple reason that they do not have the basic anatomical tools for it.
If we were somehow to discover William Shakespeare himself in the form of a great silverback gorilla wandering the misty forests of Rwanda, he would be entirely incapable of uttering a single line of Hamlet. But if Shakespeare had existed as Homo erectus, he might at least have managed a partial sentence complemented by several eloquent gestures.
How Homo erectus and the early Homo sapiens who followed might have refined their ability to speak as they slowly gained greater control over their lungs and voice boxes, lips, and tongues might have been similar to an evolutionary form of babbling. At birth, infants’ throats are virtually identical to our ape ancestors’. They have one direct passage to the lungs and another to the stomach. During the first three months of life this enables them to breathe and nurse without any risk of choking. But at about three months of age, the larynx descends in their throats and creates our uniquely human pharynx. That provides room enough for the tongue to move forward and backward and to form all the vowel sounds we can make as adults.
During the next several months infants begin to compare the sounds they can make with the language they hear around them. A feedback loop rolls into place, and out of this give-and-take, a vocabulary begins to emerge that is expressed within the framework of the grammar and syntax that seem to be hardwired into the human brain.21
This transformation takes place without anyone sitting down with the baby and pulling out a rule book that explains what words are or how grammar works. It is a natural process and it happens universally.
At eighteen months something especially remarkable takes place. All of the foundations for speech seem somehow to simultaneously and miraculously lock into place. After months of physical and cerebral development, the mechanical and neuronal engines are assembled, wired, and ready to roll. We understand our first words just before our first birthday and begin to say our first words shortly afterward, but it takes another six months or so before the next phase of the process gets under a full head of steam. The basics of grammar slip into place. We can say sentences that exhibit simple syntax, such as “Me want that.” And with that underlying template in hand, we begin to build our personal repertoire of symbolic noises—the things we call words—with astonishing rapidity.
During the ten years that pass between eighteen months of age and adolescence, children acquire an average of eleven new words a day, about forty thousand total, or one every two hours, a phenomenon that is repeated nowhere else in nature. This whole process is so powerful that it is as impossible to keep children from learning to speak as it is to prevent them from learning to walk. You would have to go to extraordinarily cruel measures to rob any human of the gift of speech.
…
>
How we managed as a species to get past simply making noises to the true creation of language is one of the great mysteries of science. Arbib looks at it this way: Something as complicated as human language would have required a precursor to hang its hat on. That brings us back to manual gesture. The experiences of the deaf children in Managua provide some real-world support for this theory. The first, older generation of students brought with them the iconic gestures they had struggled with at home, pantomimes that helped them communicate basic information. But it was the younger children who conceived ways to break gestures down into bite-size, reusable symbols. That was key.
Arbib believes that ancestral hominids like Homo erectus may have used similar pantomimes that later provided a core, if inefficient, gestural vocabulary that could then be modified and broken down into a growing stock of signed words. In this way he imagines that our ancestors’ early iconic gestures would have evolved into the more efficient and complex signing that eventually enabled early humans to assign a symbolic sound gesture to an already existing symbolic hand gesture.
Or, according to Arbib, perhaps something slightly different happened. Maybe the first hominids, struggling to move from gestural language to speech, repeated vocally what the older children of Managua did with gestures: create a single word to represent a series of fairly complex actions. The sound “‘grooflook’ or ‘koomzash’ might have once encoded a complex description like ‘The alpha male has killed a big meat animal with long teeth and now the tribe has a chance to feast together. Yum, yum!’ or commands along the lines of ‘Take your spear and go around the other side of that animal and we will have a better chance together of being able to kill it.’”
If we combine Arbib’s theories with the experience of the children in Nicaragua some interesting possibilities arise. Though that one word may say a lot, it is useful only in that single situation. It has no flexibility. So as words go, it doesn’t do its job any better than the early Nicaraguan sign for rolling downhill. After a while our ancestors would have found themselves buried in stand-alone superwords with limited use. To escape this fix they may then have broken down the superwords into conceptual chunks that could be reused and reassembled into different meanings depending on their context; precisely the way the younger Nicaraguan children did with their gestures. Arbib imagines that a tribe of protohumans might have agreed on a sound that stood for fire. Later members of the tribe might have come up with additional sounds they agreed meant “burn” and “cook” and “meat.” Soon simple, efficient sentences could be communicated that had very different meanings depending on how they mixed up the vocabulary. “Fire cook meat,” or “Fire burn!” or “Burn (the) cook!”