Authors describe a hyoid bone body, without horns, attributed to Homo erectus from Castel di Guido (Rome, Italy), dated to about 400,000 years BP. The hyoid bone body shows the bar-shaped morphology characteristic of Homo, in contrast to the bulla-shaped body morphology of African apes and Australopithecus. Its measurements differ from those of the only known complete specimens from other extinct human species and early hominid (Kebara Neanderthal and Australopithecus afarensis) and from the mean values observed in modern humans. The almost total absence of muscular impressions on the body’s ventral surface suggests a reduced capability for elevating this hyoid bone and modulating the length of the vocal tract in Homo erectus. The shield-shaped body, the probable small size of the greater horns and the radiographic image appear to be archaic characteristics; they reveal some similarities to non-humans and pre-human genera, suggesting that the morphological basis for human speech didn’t arise in Homo erectus.4
There is no way that erectus, therefore, could have produced the same kind or quality of speech, in terms of ability to clearly discriminate the same range of speech sounds in perception or production, as modern humans. None of this means that erectus would have been incapable of language, however. Erectus had sufficient memory to retain a large number of symbols, at least in the thousands – after all, dogs can remember hundreds – and would have been able, in the use of context and culture, to disambiguate symbols that were insufficiently distinct in their formants, due to the lesser articulatory capacity of erectus. However, what is expected is that the new dependency on language would have created a Baldwin effect such that natural selection would have preferred Homo offspring with greater speech production and perception abilities, both in the vocal apparatus as well in the various control centres of the brain. Eventually, humans went from erectus’s low-quality speech to their current high-fidelity speech.
What size inventory of consonants and vowels, intonation and gestures does a language need to ensure that it has the right ‘carrying capacity’ for all the meanings it wants to communicate? Language can be thought of in many ways. One way to view it, though, is as matching up meanings to forms and knowledge in such a way that hearers can understand speakers.
If it were known with certainty that Homo erectus and Homo neanderthalensis were incapable of making the full range of sounds of anatomically modern humans, would this mean that they could not have had languages as rich as sapiens? It is hard to say. It is almost certain that sapiens are better at speech than erectus and other hominins that preceded sapiens. There are innumerable benefits and advantages to being the proud possessor of a modern speech apparatus. It makes speech easier to understand. But sapiens’ souped-up vocal tract is not necessary to either speech or language. It is just very, very good to have. Like having a nice travel trailer and a powerful 4x4 pickup instead of a covered wagon pulled by two mules.
In fact, computers show that a language can work just fine with only two symbols, 0 and 1. All computers communicate by means of those two symbols, current on – 1 – and current off, 0. All the novels, treatises, PhD dissertations, love letters and so on in the history of the world can, with many deficiencies, such as the absence of gestures, intonation and information about salient aspects of sentences, be translated into sequences of 0 and 1. So, if erectus could have made just a few sounds, more or less consistently, they could be in the language game, right there with sapiens. This is why linguists recognise that language is distinct from speech. Sapiens quite possibly speak more clearly, with sounds that are easier to hear. But this only means, again, that erectus drove a Model T language. Sapiens drive the Tesla. But both the Model T and the Tesla are cars. The Model T is not a ‘protocar’.
Though hard to reconstruct from fossil records, the human vocal tract, like the human brain, also evolved dramatically from earlier hominids to modern sapiens. But in order to tell this part of the story, it is necessary to back up a bit and talk about the sounds that modern humans use in their languages. This is the end point of evolution and the starting point of any discussion of modern human speech sounds.
The evolutionary questions that lie beneath the surface of all linguistics field research are ‘How did humans come to make the range of sounds that are found in the languages of the world today?’ and, next, ‘What are these sounds?’
The sounds that the human vocal apparatus uses all are formed from the same ingredients.
The technical label for any speech sound in any language tells how that sound is articulated. The consonant [p] is known as a: ‘voiceless bilabial occlusive (also called a “stop”) with egressive lung air’. This long, but very helpful, description of the speech sound means that the [p] sound in the word ‘spa’, to take one example, is pronounced by relaxing the vocal cords so that they do not vibrate. The sound is therefore ‘voiceless’. (The sound [b] is pronounced exactly like [p] except that in [b] the vocal folds – also called cords – are tensed and vibrating, rendering [b] a ‘voiced’ sound.) The phrase ‘egressive lung air’ means that the air is flowing out of the mouth or nose or both and that it originated with the lungs. This needs to be stated because not all speech sounds use lung air. The term ‘occlusive’, or ‘stop’, means that the airflow is blocked entirely, albeit momentarily. The term ‘bilabial’ refers to the action of the upper and lower lips together. In conjunction with the term ‘occlusive’, ‘bilabial’ means that the airflow was blocked entirely by the lips. If one pronounces the sounds in the hypothetical word [apa] while lightly holding the index finger on the ‘Adam’s apple’ (which is actually your larynx), the vibration of the vocal cords can be felt to cease from the first [a] to the [p] and then start vibrating again on the second [a]. But if the same procedure is followed for the hypothetical word [aba] the vocal cords will continue to vibrate for each of [a], [b] and [a], that is for the entire duration of the word.
Though there are hundreds of sounds in the world’s 7,000+ languages, they are all named and produced according to these procedures. And what is even more important, these few simple procedures, using parts of the body that evolved independently of language – the teeth, tongue, larynx, lungs and nasal cavity – are sufficient to say anything that can be said in any language on the planet. Very exciting stuff.
Humans can, of course, bypass speech altogether and communicate with sign languages or written languages. Human modes of communication, whether writing, sign languages or spoken languages, engage one or both of two distinct channels – the aural-oral and the manual-visual. In modern human languages, both channels are engaged from start to finish. This is essential in human language, where gestures, grammar and meaning are combined in every utterance humans make. There are other ways to manifest language, of course. Humans can communicate using coloured flags, smoke signals, Morse code, typed letters, chicken entrails and other visual means. But, funnily enough, no one expects to find a community that communicates exclusively by writing or smoke signals unless they have some sort of shared physical challenge or are all cooperating with some who do.
One question worth asking is whether there is anything special about human speech or whether it is just composed of easy-to-make noises.5 Would other noises work as well for human speech?
As Philip Lieberman has pointed out, one alternative to human speech sounds is Morse code.6 The fastest speed a code operator can achieve is about fifty words per minute. That is about 250 letters per minute. Operators working this quickly, however, need to rest frequently and can barely remember what they have transcribed. But a hungover college student can easily follow a lecture given at the rate of 150 words per minute! We can produce speech sounds at roughly twenty-five per second.
Speech works also by structuring the sounds we make. The main structure in the speech stream is the syllable. Syllables are used to organise phonemes into groups that follow a few highly specific patterns across the world’s languages.7 The most common patterns are things like Consonant (C) + Vowel (V); C+C+V; C+V+C; C+C+C+V+C+C+C and so on
(with three consonants on either side of the vowel pushing the upper limits of the largest syllables observed in the world’s languages). English provides an example of complex syllable structure, seen in words like strength, s-t-r-e-n-g-th, which illustrates the pattern C+C+C+V+C+C+C (with ‘th’ representing a single sound). But what I find interesting is that in the majority of languages C+V is either the only syllable or by far the most common. With the organisational and mnemonic help of syllables, our neural evolution plus our contingency judgements – based on significant exposure to our native language – we are able to parse our speech sounds and words far faster than other sounds.
Suppose you want to say, ‘Piss off, mate!’ How do you get those sounds out of your mouth on their way to someone’s ear? There are three syllables, five consonants and three vowels in these three words, based on the actual spoken words rather than the written words using the English alphabet. The sounds are, technically, [ph], [I], [s], [ɔ], [f], [m], [ei] and [t]. The syllables are [phIs], [f] and [meit] and so, unusually in English, each word of this insult is also a syllable.
Sign languages also have much to teach us about our neural cognitive-cerebral platform. Native users of sign languages can communicate as quickly and effectively as speakers using the vocal apparatus. So our brain development cannot be connected to speech sounds so tightly as to render all other modalities or channels of speech unavailable. It seems unlikely that every human being comes equipped by evolution with separate neuronal networks, one for sign languages and another for spoken languages. It is more parsimonious to assume instead that our brains are equipped to process signals of different modalities and that our hands and mouths provide the easiest ones. Sign languages, by the way, also show evidence for syllable-like groupings of gestures, so we know that we are predisposed to such groupings, in the sense that our minds quickly latch on to syllabic groupings as ways of better processing parts of signs. Regardless of other modalities, though, the fact remains that vocal speech is the channel exclusively used by the vast majority of people. And this is interesting, because in this fact we do see evidence that evolution has altered human physiology for speech.
Human infants begin life much as other primates vocally. The child’s vocal tract anatomy above their larynx (the supralaryngeal vocal tract or SVT) is very much like the anatomy of the corresponding tract in chimps. When human newborns breathe, their larynx rises to lock into the passage leading to the nose (the nasopharyngeal passage). This seals off the trachea from the flow of mother’s milk or other things in the newborn’s mouth. Thus human babies can eat and breathe without choking, just like a chimp.
Adults lose this advantage. As they mature, their vocal tract elongates. Their mouths get shorter while the pharynx (the section of the throat immediately behind the mouth and above the larynx, trachea and oesophagus) gets longer. Consequently, the adult larynx doesn’t go up as high relative to the mouth and thus is left exposed to food or drink falling on it. As noted earlier, if this kind of stuff enters our trachea, people can choke to death. It is necessary, therefore, to coordinate carefully the tongue, the larynx, a small flap called the epiglottis and the oesophageal sphincter (the round muscle in our food pipe) to avoid choking while eating. One thing people take care to avoid is talking with their mouths full. Talking and eating can kill or cause severe discomfort. Humans seem to have lost an advantage possessed by chimps and newborn humans.
But the news is not all bad. Although the full inventory of changes to the human vocal apparatus is too large and technical to discuss here, the final result of these developments enables us to talk more clearly than Homo erectus. This is because we can make a larger array of speech sounds, especially vowels, like the supervowels ‘i’, ‘a’ and ‘u’, which are found in all languages of the world. These are the easiest vowels to perceive. We’re the only species that can make them well. Moreover, the vowel ‘i’ is of special interest. It enables the hearer to judge the length of the speaker’s vocal tract and thus determine the relative size as well as the gender of the speaker and to ‘normalise’ the expectations for recognising that speaker’s voice.
This evolutionary development of the vocal apparatus gives more options in the production of speech sounds, a production that begins with the lungs. The human lungs are to the vocal apparatus as a bottle of helium is to a carnival balloon. The mouth is like the airpiece. As air is released, the pitch of the escaping air sound can be manipulated by relaxing the piece, widening or narrowing the hole by which the air hisses out, cutting the air off intermittently and even ‘jiggling’ the balloon as the air is expelled.
But if human mouths and noses are like the balloon’s airpiece, they also have more moving parts and more twists and chambers for the air to pass through than a balloon. So people can make many more sounds than a balloon. And since human ears and their inner workings have co-evolved with humans’ sound-making system, it isn’t surprising that they have evolved to make and be sensitive to a relatively narrow set of sounds that are used in speech.
According to evolutionary research, the larynges of all land animals evolved from the same source – the lung valves of ancient fish, in particular as seen in the Protopterus, the Neoceratodus and the Lepidosiren. Fish gave speech as we know it. The two slits in this archaic fish valve functioned to prevent water entering into the lungs of the fish. To this simple muscular mechanism, evolution added cartilage and tinkered a bit more to allow for mammalian breathing and the process of phonation. Our resultant vocal cords are therefore actually a complex set of muscles. They were first called cordes by the French researcher, Ferrein, who conceived of the vocal apparatus as a musical instrument.8
What is complicated, on the other hand, is the control of this device. Humans do not play their vocal apparatuses by hand. They control each move of hundreds of muscles, from the diaphragm to the tongue to the opening of the naso-pharyngeal passage, with their brains. Just as the shape of the vocal apparatus has changed over the millennia to produce more discernible speech, matching more effectively the nuances of the language that speakers have in their heads, so the brain evolved connections to control the vocal apparatus.
Humans must be able to control their breathing effectively to produce speech. Whereas breathing involves inspiration and expiration, speech is almost exclusively expiration. This requires control of the flow of air and the regulation of air-pressure from the lungs through the vocal folds. Speech requires the ability to keep producing speech sounds even after the point of ‘quiet breathing’ (wherein air is not forced out of the lungs in exhalation by normal muscle action, but allowed to seep out of the lungs passively). This control enables people to speak in long sentences, with the attendant production not only of individual speech sounds, such as vowels and consonants, but also of the pitch and modulation of loudness and duration of segments and phrases.
It is obvious that the brain has a tight link to vocal production because electrical stimulation of parts of the brain can produce articulatory movements and some examples of phonation (vowel sounds in particular). Other primate brains respond differently. Stimulation of the regions corresponding to Brodmann Area 44 in other primates produces face, tongue and vocal cord movements but not phonation as similar stimulation produces in humans.
To state the obvious, chimpanzees are unable to talk. But this is not, as some claim, because of their vocal tract. A chimp’s vocal tract certainly could produce enough distinct sounds to support a language of some sort. Chimps do not talk, rather, because of their brains – they are not intelligent enough to support the kind of grammars that humans use and they are not able to control their vocal tracts finely enough to control speech production. Lieberman locates the main controllers of speech in the basal ganglia, what he and others refer to as our reptilian brain. The basal ganglia are, again, responsible for habit-like behaviours among others. Disruption of the neural circuits linking the basal ganglia to the cortex can result in disorders such as obsessive-compulsive disorder, schi
zophrenia and Parkinson’s disease. The basal ganglia are implicated in motor control, aspects of cognition, attention and several other aspects of human behaviour.
Therefore, in conjunction with the evolved form of the FOXP2 gene, which allows for better control of the vocal apparatus and mental processing of the kind used in modern humans’ language, the evolution of connections between the basal ganglia and the larger human cerebral cortex are essential to support human speech (or sign language). Recognising these changes helps us to recognise that human language and speech are part of a continuum seen in several other species. It is not that there is any special gene for language or an unbridgeable gap that appeared suddenly to provide humans with language and speech. Rather, what the evolutionary record shows is that the language gap was formed over millions of years by baby steps. At the same time, erectus is a fine example of how early the language threshold was crossed, how changes in the brain and human intelligence were able to offer up human language even with ape-like speech capabilities. Homo erectus is the evidence that apes could talk if they had brains large enough. Humans are those apes.
* For those interested in the history of studies of human speech, these go back for centuries. But the modern investigation of the physiology and anatomy of human speech is perhaps best exemplified in a book by Edmund S. Crelin of the Yale University School of Medicine, entitled The Human Vocal Tract: Anatomy, Function, Development, and Evolution (New York: Vantage Press, 1987). It contains hundreds of drawings and photographs not only of the modern human vocal apparatus but also of the relevant sections of fossils of early humans, as well as technical discussions of each.
Part Three
The Evolution of Language Form
How Language Began Page 22