The larynx is a small transducer that sits atop the trachea, with a top called the epiglottis that can flip closed to keep food or liquid from entering through the larynx and into the lungs, potentially causing great harm. Figure 21 shows just a glimpse of its complexity.*
One thing that every researcher into the evolution of speech agrees upon is the idea that our speech production evolved in tandem with our speech perception. Or, as Crelin puts it in his pioneering work, ‘there tends to be a precise match between the broadcast bandwidth and the tuning of perceptual acuity’. Or ‘the possession of articulate speech therefore implies that both production and perception are attuned to each other, so that parameters carrying the bulk of the speech information are optimized in both production and perception’. In other words, the ears and the mouth work well together because they have evolved together for several million years.
Figure 21: The larynx
Speech begins with air, which can create human sounds when it flows into the mouth or when it is expelled from the mouth. The former types of sounds are called ‘ingressive’ sounds and the latter are ‘egressive’ sounds. English and other European languages use egressive sounds exclusively in normal speech. Ingressive sounds in these better-known languages are rare, usually found only in interjections, such as the sound of ‘huh’ when the air is sucked in. The place where the air begins in a sound is called the ‘initiator’. In all of the speech sounds of English, the lungs are the initiator. Thus one says that all sounds of English are ‘pulmonic’ sounds. But there are two other major air initiators that many languages of the world use, the glottis (the opening in the larynx, for glottalic sounds) and the tongue (for lingual sounds). These are also not sounds of English.
To quote from my Language: The Cultural Tool:
[In] Tzeltal, Ch’ol and others, so-called ‘glottalised’ sounds – implosives and ejectives – are common.
When I began my linguistic career, in the mid-1970s, I went to live for several months among the Tzeltales of Chiapas, Mexico. One of my favourite phrases was c’uxc’ajc’al ‘it’s hot outside’, which contains three glottalised consonants (indicated in Tzeltal orthography by the apostrophe). To make these sounds, the glottis, the space between the two vocal cords in the larynx, must be closed, cutting off air from the lungs. If the entire larynx is then forced up at the same time that the lungs or tongue cut off the flow of air out of the mouth, then pressure is created. When the tongue or lips then release the air out of the mouth, an explosive-like sound is produced. This type of sound, seen in the Tzeltal phrase above, is called an ‘ejective’. We could also produce the opposite of an ejective, known as an ‘implosive’ sound. To make an implosive, the larynx moves down instead of up, but everything else remains the same as for an ejective. This downward motion of the larynx will produce an implosive – caused by air suddenly rushing into the mouth. We do not have anything like these sounds in English.
I remember practising ejectives and implosives constantly for several days, since the Tzeltales I worked with use them both. They are interesting sounds – not only are they fun sounds, but they extend the range of human speech sounds beyond the strictly lung-produced sounds in European languages.
The glottis can be used to modify sounds in other ways. Again, from Language:
A different type of glottalised sound worth mentioning is produced by nearly, but not quite, closing the glottis and allowing lung air to barely flow out. This effect is what linguists call ‘creaky voice’. People often produce creaky voice involuntarily in the mornings after first arising, especially if their vocal cords are strained through yelling, drinking, or smoking. But in some languages, creaky voice sounds function as regular vowels.
Some glottalised sounds are known as clicks. These are created by using the tongue to block the flow of air into or out of the mouth, while pressure builds up behind the glottis. As with other sounds using the lungs or the glottis, lingual sounds can also be egressive or ingressive, produced by closing the airflow off with the tip of the tongue while building pressure inward or outward with the back of the tongue. Clicks are found in a vary small number of languages, all in Africa and almost all languages of the Bantu family. I remember first hearing clicks in Miriam Makeba’s ‘click song’. Makeba’s native language was Xhosa, a Bantu language.
Figure 22: The International Phonetic Alphabet
A list of all the consonants that are produced with lung air are given in the portion of the International Phonetic Alphabet shown in Figure 22.
Consonants are different from vowels in several ways. Unlike vowels, consonants impede (rather than merely shaping) the flow of air as it comes out of the mouth. The International Phonetic Alphabet (IPA) chart is recognised by all scientists as the accepted way of representing human speech sounds. The columns of the chart are ‘modes’ of pronunciation. These modes include allowing the air to flow out of the nose, which produces nasal sounds like [m], [n] and [ɳ]. Other modes are ‘occlusives’ or ‘stops’ (air is completely blocked as it flows through the mouth), sounds such as [d], [t], [k], or [g]. And there are ‘fricatives’, where the airflow is not completely stopped, but it is impeded sufficiently to cause friction, turbulent or sibilant sounds, such as [s], [f] and [h].
The rows in the IPA chart are places of articulation. The chart starts on the left with sounds produced near the front of the mouth and moves further back in the throat. The sounds [m] and [b] are ‘bilabials’. They are produced by blocking the flow of air at the lips, both upper and lower lips coming together to completely block the flow of air. The sound [f] is a bit further back. It is produced by the lower lip touching the upper teeth and only partially impeding rather than completely blocking the flow of air. Then we get to the sounds like [n], [t] and [d], where the tongue blocks the flow of air either just behind the teeth (as in Spanish) or at the small ridge (alveolar ridge) on the hard palate (roof of the mouth), not far behind the teeth (as in English).
We eventually get to the back of the mouth, where sounds like [k] and [g] are produced by the back of the tongue rising to close off air at the soft palate. In other languages, the sounds go back further. Arabic languages are known for their pharyngeal sounds, made by constricting the epiglottis or by retracting the tongue into the pharynx. The epiglottis is a piece of stretchy cartilage that comes down to cover the hole at the top of the larynx just in case food or liquid tries to get in. One should not talk with a full mouth; if the epiglottis is not at the ready this could be fatal. Humans, except human infants, are the only creatures that cannot eat and vocalise at the same time.
What is crucial in the IPA charts is that the segments they list almost completely exhaust all the sounds that are used for human languages anywhere in the world. The phonetic elements therein are all easy (at least with a bit of practice) for humans to make. But the basal ganglia do favour habits, and so once we have mastered our native set of phonemes, it can be hard to get the ganglia out of their rut to learn the articulatory habits necessary for the speech of other languages.
But consonants do not speech make. Humans also need vowels. To take one example, the vowels of my dialect of English, Southern California, are as shown in Figure 23.
Just like the consonant chart, the vowel chart in Figure 23 is ‘iconic’. Its columns represent the front of the mouth to the back. The rows of the vowel chart indicate the relative height of the tongue when producing the vowel. The trapezoidal shape of the chart is intended to indicate, again iconically, that as the tongue lowers, the space in the mouth between the vowels shrinks.
California vowels, like all vowels, are target areas where the tongue raises or lowers to a specific region in the mouth. At the same time, as the tongue moves to the target area to raise or lower, the tongue muscles are tense or relaxed (‘lax’). The lips can be rounded or flat. The tense vowel, [i] is the vowel of the word ‘beet’. The lax vowel [ɪ], on the other hand, is the vowel heard in the word ‘bit’. In other words ‘bit’ and ‘beet’ a
re identical except that the tongue muscles are tense in the ‘beet’ and relaxed in ‘bit’. Another way of talking about ‘lax’ vs ‘tense’ vowels, one preferred by many linguists, is to refer to them as ‘Advanced Tongue Root’ (the tongue is thereby tensed by being pushed forward in the mouth and flexing) or ‘Not Advanced Tongue Root’ (the tongue is relaxed, its root further back in the mouth), usually written in the linguistics literature as [+ATR] or [−ATR].
The funny-looking vowel character [æ] is the vowel in my dialect for ‘cat’. It is low, front and unrounded. But if moving up the chart and to the back of the mouth the sound [u] is reached, the vowel sound of the word ‘boot’. This is a back, rounded vowel. ‘Back’ in this sense means that the back portion of the tongue is raised, rather than the front portion (the blade or tip) as in the unrounded vowel [i]. The lips form a round ‘o’ shape when producing [u]. Any vowel can be rounded. Thus, to make the French vowel [y], make the English vowel [i] while rounding the lips.
Figure 23: Southern California English vowels
The point is that the various speech sounds available to all human languages are conceptually easy to understand. What is hard about them is not how to classify nor even to analyse them, but how to produce them. Humans can learn all the sounds they want when they are young and their basal ganglia are not in a rut. But as we get older, the ganglia are challenged to make new connections.
When I was enrolling for my first course in articulatory phonetics (in order to learn how to make all the sounds of speech used in all the languages of the world) at the University of Oklahoma in 1976, the teaching assistants for the course gave each student an individual interview in order to place them in sections according to phonetic ‘talent’ (or perceptual ability). I walked into the classroom used for this purpose and the first thing I was asked to do was to say the word ‘hello’ – but by sucking air into my lungs rather than expelling air outwards. ‘Weird,’ I thought. But I did it. Then I was asked to imitate a few words from Mayan languages with glottal ejectives. These are ‘popping’ sounds in which the air comes out of the mouth but originates above the lungs, with pressure built up by bringing the vocal folds together and then letting the air behind them ‘eject’ out of the mouth. And I tried imitating African click sounds. This course was going to be valuable to me, I knew, because I was preparing to go to the Amazon to conduct field research on a language that was still largely unknown to the outside world, Pirahã.
Again, every language in the world, from Armenian to Zapotec, uses the same inventory of articulatory movements and sounds. The reason for this is that the human auditory system co-evolved with the human articulatory system – that is, humans learned to hear best the sounds they are able to make. There are always outliers, of course, and there are still unexpected novelties to be discovered. In fact, I have personally discovered two sounds in the Amazon over the years (one in the Chapakuran language family, the other in Pirahã) not found in any other language of the world.
The field linguist needs to learn what sounds the human body can make and use for speech because she or he must be prepared to begin work from the very first minute they arrive at their field destination. They have to know what they are hearing in order to begin an analysis of the speech and language of the people they have gone to live with.
This brief introduction covers only part of one-third of the science of phonetics, articulatory phonetics. But what happens to speech sounds once they exit the mouth? How are people able to distinguish them? Hearers are not usually able to look into the mouth of the person speaking to them, so how can they tell whether she or he is making a [p] or a [t] or a [i] or a [a]?
This is the domain of acoustic phonetics. An immediate question regarding sound perception is why, if air is coming out of the mouth when one is talking, is it that only the consonants and vowels are heard, rather than the sound of air rushing out of the mouth? First, the larynx energises the air by the vibration of the vocal cords, or oscillation of other parts of the larynx. This changes the frequency of the sound and brings it within the perceptible range for humans because evolution has matched these frequencies to human ears. Second, the sounds of the air rushing out of the mouth have been tuned out by evolution, falling below the normal frequency ranges that the human auditory system can easily detect. That is a good thing. Otherwise people would sound like they’re wheezing instead of talking.
The energising of the flow of air in speech by the larynx is known as phonation, which produces for each sound what is known as the ‘fundamental frequency’ of the sound. The fundamental frequency is the rate of vibration of the vocal folds during phonation and this varies according to the size, shape and bulk (fat) of the larynx. Small people will generally have higher voices, that is, higher fundamental frequencies, than larger people. Adults have deeper voices, lower fundamental frequencies, than children, men have lower voices than women and taller people often have deeper voices than shorter people.
The fundamental frequency, usually written as F0, is one of the ways that people can recognise who is talking to them. We grow accustomed to others’ frequency range. The varying frequency of the vibration of the vocal cords is also how people sing and how they control the relative pitches over syllables in tone languages, such as Mandarin or Pirahã, among hundreds of others where the tone on the syllable is as important to the meaning of the word as the consonants and vowels. This ability to control frequency is also vital in producing and perceiving the relative pitches of entire phrases and sentences, referred to as intonation. F0 is also how some languages are whistled, using either the relative pitches on syllables or the inherent frequencies of individual speech sounds.
It may surprise no one to learn, however, that F0 is not all there is. In addition to the fundamental frequency, as each speech sound is produced harmonic frequencies, or formants, are produced that are associated uniquely what that particular sound. These formants enable us to distinguish the different consonants and vowels of our native language. One does not directly hear the syllable [dad], for example. What is heard are the formants and their changes associated with these sounds.
A formant can be visualised by hitting a tuning fork that produces the note ‘E’ and placing it on the face of an acoustic guitar near the sound hole. If the guitar is tuned properly, the ‘E’ string of the same octave as the tuning fork will vibrate or resonate with the fork’s vibrations. This resonance is responsible for the different harmonics or formants of each speech sound. These formants can be seen in a spectrogram, with each formant at a particular multiple of the fundamental frequency of the sound (Figure 24).
Figure 24: Vowel spectrogram
In this spectrogram of four vowels the fundamental frequency is visible at the bottom and dark bands are visible going up the columns. Each band, associated with a frequency on the left side of the spectrogram, is a harmonic resonance or formant of the relevant vowel. Going left to right across the bottom, the time elapsed in the production of the sounds is measured. The darkness of the bands indicates the relative loudness of the sound produced. It is the formants that are the ‘fingerprints’ of all speech sounds. Human ears have evolved to hear just these sounds, picking out the formants which reflect the physical composition of our vocal tract. The formants, from low to high frequency, are simply referred to as F1, F2, F3 and so on. They are caused by effects of resonators such as the shape of the tongue, rounding of the lips and other aspects of the sound’s articulation.
The formant frequencies of the vowels are seen in the spectrogram (given in hertz (Hz)). What is amazing is not that only that we hear these frequency distinctions between speech sounds, but that we do so without knowing what we are doing, even though we produce and perceive these formants so unerringly. It is the kind of tacit knowledge that often leads linguists to suppose that these abilities are congenital rather than learned. And certainly some aspects of this are inborn. Human mouths and ears are a matched set, thanks to natural selection.
&nbs
p; There is too little understood about how sounds are interpreted physiologically by our ears and brains to support a detailed discussion of auditory phonetics, the physiology of hearing. But the acoustics and articulation of sounds is quite enough to prime a discussion of how these abilities evolved.
If it is correct to say that language preceded speech, then it would be expected that Homo erectus, having invented symbols and come up with a G1 language, would still not have possessed top-of-the-line human speech capabilities. And they did not. Their larynges were more ape-like than human like. In fact, although neanderthalensis had relatively modern larynges, erectus lagged far behind.
The main differences between the erectus vocal apparatus and the sapiens apparatus were in the hyoid bone and pre-Homo vestiges, such as air sacs in the centre of the larynx. Tecumseh Fitch was one of the first biologists to point out the relevance of air sacs to human vocalisation. Their effect would have been to render many sounds emitted less clear than they are in sapiens. The evidence that they had air sacs is based on luck in finding fossils of erectus hyoid bones. The hyoid bone sits above the larynx and anchors it via tissue and muscle connections. By contracting and relaxing the muscles connecting the larynx to the hyoid bone, humans are able to raise and lower the larynx, altering the F0 and other aspects of speech. In the hyoid bones of erectus, on the other hand, though not in any fossil Homo more recent than erectus, there are no places of attachment to anchor the hyoid. And these are not the only differences. So different are the vocal apparatuses of erectus and sapiens that Crelin concludes: ‘I judge that the vocal tract was basically apelike.’ Or, as others say:
How Language Began Page 21