When the subjects were rewarded for discriminating coos by peak position, Japanese macaques performed better than control species (other macaques and vervets). But when the tests involved discriminating initial frequency, Japanese macaques were slower than other species. Moreover, in all these tests, Japanese macaques showed a right ear advantage, just as humans do in language tasks4.
In addition, like vervet grunts, coos all sound similar to human observers. As Cheney and Seyfarth point out, “the size of vocal repertoires—in primates or any other animal—cannot be assessed by the human ear alone.”
Some ethologists have concluded that the more intelligent primates can understand some human concepts and even learn a language based on gesture (though many linguists remain skeptical), even though they can't learn articulate speech. Although this is often blamed on the inadequacies of the primate vocal tract, we should keep in mind that parrots and mynahs, with their very different vocal apparatus, can produce an accurate imitation of human speech. Instead, the inability of apes to learn a human language seems to reside in the nature of the neural programming that underlies both primate call systems and human speech.
* * * *
The Perception of Speech
Human speech is not simple. Instructions to produce a word originate in the brain in the form of neural currents. These in turn produce movements of the vocal tract. The muscular movements generate sound waves, which travel to the ear of the listener, where the whole process is reversed—sound wave to movements of the eardrum to neural impulses.
Two factors complicate this process. First, the various events take place at vastly differing speeds. A nerve impulse moves at about 150 to 300 ft/sec; a sound wave at 1,100 ft/sec. The vocal tract, made up of flesh and bone, cannot move with unlimited speed and precision.
Secondly, the movements of the vocal tract overlap. In a word like “gab,” for instance, the gesture of the lips that forms “b” takes time to happen. As the lips close and the shape of the vocal tract changes, the sound of the vowel likewise changes.
The vocal tract resembles a woodwind such as a bassoon or saxophone. It is an inverted L-shaped tube (Figure 1) with a pump (the lungs) at the lower end to produce a moving current of air; the upper end acts as a filter. The vocal cords function like the reed in a wind instrument and we can further modify the sounds by changing the shape of the tube.
In Figure 1, A represents the oral cavity, B the nasal passages, and C the pharynx, the back wall of the throat. The lungs are in all languages the primary initiator of the necessary current of air and the larynx and vocal cords act both as a sound source and a valve. Most speech activity takes place in the oral and pharyngeal cavities, while the nasal passages when the velum (v) is open act as a resonating chamber imparting to sounds the quality known as nasalization. The right-angle bend in the vocal tract (quite different from the shallower bend in newborns and apes) fosters stable, highly perceptible vowel sounds. The movements of the tongue (t) and lips (which linguists refer to as gestures5) impede and direct the vibrating stream of air in various ways, which in turn modulate the sound, produced.
* * * *
Figure 1. Vocal tract. Suggested by diagrams in Ladefoged, 1975, and elsewhere.
* * * *
Several types of modulation are possible. The air current can be blocked completely as happens when the lips are closed to produce a p or b-like sound. Or the channel can be narrowed to generate audible friction like that at the end of “hiss” or “hush” or in a German ach-sound. The tongue or lips can vibrate to produce a trill, or the tongue can be closed on only one side, allowing air to escape round the other as in an l-sound.
If we look at the speech act in more detail, it seems to take place in a series of stages that translate a sequence of distinct abstract representations somehow represented in the brain and nervous system of the speaker into a continuous flow of speech as in Figure 2. And speech can flow very fast indeed, an estimated fifteen phonemes per second or more. How do human beings achieve what Liberman calls “high-speed performance with low-speed machinery?”
* * * *
The optimum vocalization produced by the speech apparatus, the type which is most common in the world's languages, can be represented as CV(C), that is, an initial burst of noise (consonant), followed by a musical tone (vowel), which in some languages may be followed by a second burst of noise. In all languages, words are made up of sequences of one to five or more of these units.
To take examples from English: “cap,” “sap,” “map,” “apt,” “pack"—all of these but “pack” are written with three letters and the linguist would agree here with the layman's opinion that they all consist of three distinct vocal gestures. Nevertheless, the initial gestures of these words are not all perceived in the same way. The initial m—and s—of “map” and “sap,” a hum and a hiss, are clearly identifiable even without any vowel attached. But the p and ck of “cap” and “pack” are different.
* * * *
Figure 2.
* * * *
The gesture of the lips that produces the—p at the end of “cap” is visible to the listener, the—p is written with a character of its own, we can hear the difference between “cap” and “cat,” etc. so we naively assume that there is a discrete sound —p of the same order as the final consonants of “cab,” “can,” “cash,” “car.” But if we think about it, we will realize that the sound made by simply bringing the lips together is not likely to be audible to normal hearing. What then are we perceiving as—p? It is only within the last sixty years that linguists have been able to answer this question.
So far we have been talking about the way in which speech sounds are produced. But communication implies a second participant, the listener. Once the speech sounds have been generated, how does the listener read what the American linguist Charles Hockett6 calls the “continuous muddy signal” of speech and transform it back into information?
Late in the 1940s, the study of the physical aspect of speech improved dramatically with the invention of the acoustic spectrograph, which produces a physical representation of a speech sound. Figure 3 shows acoustic spectrograms of the words “bab” and “gag” pronounced at normal speed (fast). The vertical axis represents frequency in Hertz and the horizontal axis duration in milliseconds. The vibrations of sound waves show up on the spectrograph as vertical striations. At frequencies where energy is concentrated, the striations form dark bands called the formants of the vowel, analogous in some ways to the harmonics of a musical tone. In the figure, there is a formant roughly every thousand Hertz. The formants are not stationary, but change in ways that depend on the adjacent consonant.
The most important formants for perceiving speech are the first, second, and third. You can hear the second formant; if you whisper the vowels of the words “keyed,” “kid,” “Ked,” “cad,” “cod,” “cawed,” “could,” “cooed” in sequence, you will perceive a steady drop in pitch. The twelve-year-old Isaac Newton noticed this over three hundred years ago when he wrote in his 1665 notebook “The filling of a very deepe flaggon with a constant streame of beere or water sounds the vowels in this order w, u, v, o, a, e, i, y.” In Newton's example, of course, the vowels are in the opposite order and the pitch of the second formant is steadily rising.
* * * *
Figure 3. Acoustic spectrograms provided by Amanda Miller-Ockhuizen, used by permisssion.
* * * *
Some speech sounds are perceived in the same way as non-speech sounds: steady-state vowels, friction sounds like s and f. But vowels are seldom steady state at conversational rates of speed. Instead, the formants change rapidly as the vocal organs move from sound to sound. And surprisingly, these formant transitions are not particularly consistent. Instead, our auditory impression may be quite different depending on the vowel so that the same acoustic signal may be heard as p before the vowel i and k before the vowel a.
Vowels are resonant; other sounds such as fricatives may
have characteristic bursts of noise. But many consonants are perceived primarily by their effect on a neighboring vowel. Look, for example, at the spectrogram of “bab” in Figure 3. As the lips open and close, the formants show a rise and fall in frequency and in the figure you can clearly see the curve upwards and downwards in the second and third formants. On the other hand, in “gag,” the g's cause a narrowing of the distance between the second and third formants.
What this means is that we do not perceive speech sounds as a linear series of discrete symbols like Morse code. Instead, each sound contains information about other sounds in the syllable. Even though we can represent the abstract structure of a word as a string of phonemes, in the real acoustic signal these blur together into a single gestalt. Like Green's macaques, humans are predisposed to pay attention to certain features of the speech signal while ignoring others and perceive the vocalizations of their own species in a different way from other sounds.
If we record a message in Morse code, we can cut the tape between signals. If a word such as “gag” were a series of distinct sounds as it appears to be from the spelling, we should be able to do the same thing. Suppose we record “gag” on audiotape, cut and splice the tape at some point, and replay it. We will discover that no matter how close we cut to the beginning of the word, we won't be able to isolate the initial consonant. Even if we erase from both beginning and end, subjects will still hear the word “gag” when the tape is played back until so much is erased we hear only noise.
Conversely, if we hand-paint a spectrogram and play it back, we can produce an artificial vowel sound. Depending on how we curve the formants, subjects may report hearing a consonant before the vowel. A low beginning for the second formant suggests b while a very high beginning will sound like g (what is meant here is the “hard” sound of g in “gab,” “goose,” “gull").
As the experimental data makes clear, information about the sounds in a syllable is not localized, but is scattered throughout the syllable. We perceive a given vowel by the frequencies of its formants and perceive many neighboring consonants by means of changes in those formants. Would alien languages have this property?
I would guess that languages using the vocal channel will have five properties: audibility, salience, redundancy, high modulation, and coding.
To communicate, we must be heard. All vocal languages must have some sounds that function like vowels or resonants (sounds like m, n, l, etc.) and carry a great deal of acoustic energy. Sounds must also be salient; that is, stand out from their neighbors. Hissing and rasping sounds like s, the German ach-sound, and the clicks used in Khoisan languages are highly salient.
In a writing class, you lose points for being redundant, so it may seem strange that redundancy is a necessary design feature of language. In fact, SF writers are fond of inventing languages like Heinlein's Speedtalk (in “Gulf,” Astounding, Nov-Dec 1949, reprinted in Assignment in Eternity, Baen, 2000) that lack redundancy. But we live in a noisy universe and all natural languages have high redundancy (up to 50%) to ensure that at least part of the message gets through the noise.
For example, in English the expression of number is redundant. In “three dogs,” the noun “dog” is marked as plural even though that is already clear from the “three.” But in Hungarian, the plural suffix isn't used with nouns modified by a number greater than one (lány “girl,” lányok “girls,” két lány “two girls"). In English, the phrase “the dog eats” is marked as singular twice; by the absence of—s in the noun and the presence of—s in the verb.
Redundancy is now generally called enhancement in the phonetic literature. For an example of phonetic redundancy in English, say the words “see” and “she” while looking in a mirror. You will see your lips purse slightly for “she” but not for “see.” This lip rounding isn't a necessary design feature of sh. Some languages have sh-sounds with spread lips. But it alters the sound slightly (as you will perceive if you try making a spread-lip sh) and helps English speakers to distinguish the two sounds.
Nor should we ignore the importance of the visual clue here. When watching a silent film you've probably noticed that you can often lip-read even without special training and visible speech gestures often supplement the auditory clues.
These three properties are fairly obvious. For the fourth and fifth properties, I can offer only indirect evidence. High modulation is a consequence of the fact that an unlimited amount of information (the open class of “meanings") is being encoded in a very small set of symbols (the closed class of “sounds,” closed in the sense that new sounds can't be freely added to a language, although new meanings can be assigned to old words or newly invented words). The term “coding” is taken from the paper by Liberman et al. in the bibliography, where the authors argue that human sound systems are not ciphers like Morse code, but a more complex type of Gestalt encoding.
Although primate call systems are not languages in the human sense, it is suggestive and interesting that they seem to possess something like coding. That is, primates interpret their calls with the help of internal programming that is species-specific. This type of processing is not unique to language. For example, memories are not stored in one part of the brain, but scattered throughout in the manner of a hologram. It is also interesting that all early human writing systems used logograms7, which have a unique representation for each syllable, and that the more abstract alphabet was apparently only invented once in human history.
* * * *
Talking to the Stars
Much of the speculation on communication with aliens has seriously underestimated the difficulties. Programs like SETI search for radio signals as evidence of life. Based on an exchange of signals, we could begin with simple universal ideas and gradually learn to communicate complex concepts, or so it is said. Maybe so. Fred Hoyle's The Black Cloud assumes that a super-intelligent being could easily analyze our languages and learn to communicate with us. Again, maybe so. But the evidence we have doesn't bear this out.
Although apiculture has been practiced for thousands of years, it was only in this century that von Frisch realized that bee dances were a form of language. Primate call systems are larger and more complex than we used to think. Dolphins and whales exchange elaborate vocal signals, but no one is sure how much of this is communication or what it's about. In all these cases, we humans are studying species much less intelligent than ourselves. All attempts to understand animal communication systems have, in fact, been attended by great difficulty.
Am I saying that communication with aliens is impossible? By no means. Intelligence and technology will count for much. But I suspect we won't be able to sit down as we might with a human informant.
Like exobiology, extraterrestrial linguistics is a discipline without a subject and by necessity is 90% rank speculation. Let us try to temper the rankness as much as possible and very cautiously speculate about what we might actually encounter in a First Contact situation.
In this article, I am discussing only communication systems using sound as a channel, audible to human beings. Obviously, other channels are possible: witness gesture languages like ASL (American Sign Language). Possibilities that have been explored in science fiction include odors, radio waves, and electric signals. Even within the vocal channel, the ranges can be different. Cats and dogs can hear higher sounds than humans, and whales can perceive subsonic signals.
If the message is a modulated sound, it implies a signal and modulators, which in turn implies that the physical organs must include an acoustic generator, a resonant space, and some way of modifying the resulting sounds.
Human language is produced by organs which over thousands of millennia of evolution have been adapted from pre-existing structures. This implies two corollaries: that the association of human speech with the respiratory and alimentary systems is arbitrary, and secondly that these pre-existing structures provide certain fixed parameters that may shape the evolving system. Evolution by modification of what i
s already there is common enough. The lungs derive from a fish's swim bladder; some snakes have transmuted their saliva into poison; wings are modified arms or fingers.
This point becomes all the clearer when we consider the audio system of a radio or TV; without lungs, vocal cords, lips, tongue, or teeth, the vibrating diaphragm in a speaker can reproduce a wider range of sounds than the human vocal apparatus. In other words, the vocal apparatus of another intelligent species need not closely resemble that of human beings, need not possess close analogues of mouth, lips, teeth, tongue, need only have the basic components mentioned above: acoustic generator, one or more modulators, and perhaps a resonant space.
Whatever body parts make up the communication system of an intelligent animal, however, their physical structure will impose certain limitations, which in turn limit the type and range of sounds that can be produced. Take, for example, the human tongue. The tongue is a flexible muscle rooted to the floor of the mouth and back of the throat. The tongue functions as a sensory organ, as a food mixer, and as part of the speech mechanism. In the latter function, the tongue can be pushed forward or bunched up; these movements affect the quality of vowel sounds. The tongue can also be brought into contact with other parts of the mouth to produce various consonant sounds. The most common are t-like sounds made by touching the tip of the tongue to the teeth or roof of the mouth, and k-like sounds made by touching the back part of the tongue to the soft palate or even further back. It is easy enough to touch the lips with the tip of the tongue. Such linguo-labial sounds do not occur in European or Asian languages, but they are found in languages spoken in Vanuatu, such as Tangoa, where mata—"snake"—contrasts with mnata—"eye” (I'm using mn—to represent a single sound, an m formed with the tip of the tongue on the upper lip).
Analog SFF, May 2007 Page 8