The beauty of this field is that you can actually do experiments: you can give such vocabularies to real-life subjects—not just simulated cyber-agents, but flesh and blood people—and get them to perform varying types of communicative tasks with these limited resources. Of course you can’t actually replicate Stone Age languages—we’re humans and they weren’t—but you’ll at least establish upper limits on what could have been produced, and I’m willing to bet that in the process, facts we never knew about language will emerge.
The only person I know of who’s doing this kind of research is Jill Bowie, who’s just finished her doctoral dissertation at the University of Reading, England. So far, she has presented people with a fifty-word vocabulary and a Survivor-type scenario, and required them to communicate using only that vocabulary. Here’s just one example: one of her subjects, needing to express “In the forest, Fred was bitten by an enormous snake,” produced “Many many tree (CIRCLES HANDS IN AIR TWICE)/Fred/snake/big snake/big big snake/(TAPS LEG).” The mix of modalities is particularly impressive, rendering moot all those arguments about whether language was originally signed or spoken or mimed. Answer: all of the above.
More such experiments should be carried out, varying both the content of the vocabulary and its size. Which words, and how many of them, would be needed for significant communication in the different areas of human activity—child care, social intrigue, toolmaking, exchanging gossip? It would be possible to test empirically what I’ve claimed here, that none of these activities could have triggered language, either because words required by them are too abstract to be plausible as early inventions, or because too many words would have had to be invented before any useful or interesting messages could be exchanged. Since those same activities, though highly implausible as prime causes, could still have contributed to the expansion of protolanguage once it was up and running, the likely kind and extent of their contribution might also be assessed.
Possible pitfalls abound, of course. One species, no matter how smart, can never think its way back into the skin of another. Facile interpretations, slick just-so stories—these and more will bedevil the inquiry. The belief that sheer contingency rules wherever humans are concerned, making reconstruction futile, threatens the work from an opposite direction, and must also be resisted. That may be a valid belief, but we’ll never know if we haven’t tested it. And if we can unravel what happened in the first few seconds of the universe, discovering what happened at the beginning of protolanguage should not lie altogether beyond human powers.
CONNECTION AGAIN
But what the foregoing assumes is that from a very early stage words could be connected with one another to form more complex messages. Is that a reasonable assumption? Granted that apes and young children learn how to do this without explicit instruction, we must still bear in mind that both have numberless models—parents and trainers—to show them that it’s possible. And as we have seen, no ACS can properly combine any of its units. Aren’t we assuming too much here?
I don’t think so. The reason ACS units can’t combine does not arise from any mysterious constraint on combinability. It arises from two simple facts: ACS signs are complete in themselves, and combining them would make no kind of sense.
For words, these conditions are reversed. Words are seldom complete in themselves, and in isolation may mean a variety of things. A man driving on a country road met a female driver coming in the opposite direction, and she shouted something at him, of which he heard only the word “Pig!” Not unnaturally, he thought he was dealing with some rabid feminist obsessed by male chauvinism, and was forced to brake violently when, around the next bend, he saw an enormous porker lying in the middle of the road. While imperatives like “Stop!” or “Run!” may be unambiguous, most words acquire precise meaning only when coupled with other words.
Animals can combine actions into a series, so why not words, once they can acquire them? Since the simplest way to combine things is to string them together (a process Chomsky dismissed as logically impossible, remember?) we can safely assume that that was how protolanguage did it. It’s almost certainly the way apes and pidgin speakers do it. And if I’m right in what I conjectured in chapter 10—that concepts not only have to be established but must be neurally linked to one another before serious thought or language can appear—it’s necessarily the case that apes and hominids should do it that way. (Pidgin speakers, of course, resort to beads-on-a-string linkage not for this reason but because pidgin, unlike their native languages, has no automated system for creating hierarchical structures.) So protolanguage, after perhaps a million years or more, would have looked something like a pidgin, with words not significantly different in meaning from modern words.
Of course they might not have sounded at all like modern words. Modern words form large vocabularies and have complex sound structures. These facts are interconnected. The more words you have, the harder it becomes to distinguish one from another. Consequently, a trade-off is involved between speech sounds and word length: the fewer the sounds your language uses, the longer the words must be if they’re to be distinguished from one another.
Modern languages make this trade-off in different ways. They have a range of distinctive speech sounds that stretches from 11 (Rotokas, spoken in Papua New Guinea) to 112 (!xoo, spoken in Botswana). But note that what actually happens in given languages provides inadequate clues to the full biological capacity of humans. A Rotokan baby, placed in infancy among the !xoo, would surely grow up speaking fluent !xoo. So the ability to make and distinguish a wide range of speech sounds now forms part of human biology—we could all of us have potentially made and distinguished all of the 112 sounds of Ixoo, even if at our present age that’s become an impossible feat.
In other words, any increase in vocabulary would have selected strongly for an increase in phonological complexity. And this, in the later years of protolanguage, would have begun one of the processes that would eventually distinguish true language: the double layering of sounds and words.
I’m imagining that the earliest protolanguage words—as distinct from the manual signs and other signals of meaning—would have been indivisible chunks of sound, sharing no features with other words. If this condition held, there couldn’t be very many words. Beyond a certain very vague limit, you’d have to go to the system modern languages use, forming words from a selected handful of meaningless sounds, sounds that formed a finite set but that could be combined in, for all practical purposes, an infinitude of ways.
But the point I really want to focus on is that language, like niche construction, is an autocatalytic process. Once it’s started, it drives itself; it creates and fulfills its own demands. The more you do, the more you can do, and indeed the more you have to do. This may sound magical, but it really isn’t. The degree of flexibility inherent in gene expression, far from limitless yet by no means negligible, interacts with the experiences of members of a species to generate new and more focused behaviors. That’s how evolution works.
It’s likely too that protolanguage in its later stages acquired what many might term syntax, though it really isn’t. The simple fact of predication—saying something, then saying something about that thing—gives a fairly fixed serial order to utterances. In many, maybe most cases, utterances still go from the known to the unknown—they tell you a class or a name that you already know, then add some (hopefully) new information about it. So (as in the pidgin version of the barbed-weapon scenario, above) there would probably have been a statistical preponderance of what, in a true language, you’d have to call “subject-first” sentences.
But that’s something you could do with beads-on-a-string chaining. To get to the next level, you had to put things together in a different way.
12
THE SAPLING
BECOMES AN OAK
ONLY CONNECT, BUT GET IT RIGHT THIS TIME
We saw in chapter 9 that there are two ways in which words can be put together—like bead
s on a string or in hierarchical structures, forming A and B into [AB] and then adding C, not to A, not to B, but to the new unit [AB]. And so on, ad infinitum. Originally beads on a string was all there was. Later, much later, this way was relegated to units larger than the sentence, or to those of us who need to speak in a language barely known to us, or to initiate some means of linguistic contact from scratch (the fate of pidgin speakers worldwide). For sentences and all smaller units, the hierarchical way became universal.
When this happened, we don’t know. My best guess is at the very earliest a couple of hundred thousand years ago. That’s the earliest date so far suggested for the origin of our species. And it’s around then that the first signs of really human behavior become manifest. Tools start to shape up a little, but it’s not that. People are beginning to use ochre and other pigments to decorate their bodies (or so we assume—they were using the stuff for something, that’s for sure). Types of stone used for tool manufacture are found hundreds of miles from their sources, which suggests that some form of trade had started up. That meant contact between groups that probably didn’t even speak the same protolanguage.
Recall that in protolanguage, the speaker thought of a word and then transmitted it directly to the organs of speech, then the next, and the next, without linking them in the brain prior to utterance. In language today, words up to at least the phrase level are assembled within the brain and a much more complex message is sent to the vocal organs. Before that could be accomplished, at least two conditions had to be satisfied.
One we have glimpsed already, and it’s essential—without it, even the simplest hierarchical structures are impossible. That’s the establishment of neural links between representations of different words, representations that are widely distributed in the neocortex. (What the combined message looks like, whether it’s the mere sum of two or more messages, or whether these undergo mergers and/or changes, and if so, of what kind, are mysteries that as yet we don’t seem even near solving.) But there’s a further condition that had to be met before hierarchical structuring could serve as a viable alternative to good old-fashioned beads-on-a-string processing.
Sending any kind of message through the brain takes up a measurable amount of time, even if it’s measurable only in milliseconds. When each word goes directly to utterance on its own, that time is very short, so some serious constraints on neural messaging have little or no effect here.
But those constraints do affect longer messages. They are, one, the fact that nerves are leaky, hence the quality of any message will degrade over time, and two, the fact that the brain is a very noisy place, with all sorts of other activities going on constantly—a factor that also degrades message quality.
William Calvin of the University of Washington, the author of several popular books on human evolution, pointed out that what happens in the brain resembles what happens in the singing of choirs. If only five or six people sing together, you can tell very quickly if one of them is out of tune; if a choir of a hundred or more voices is singing, half a dozen could be out of tune and you’d never know. The variation between voices averages out, so to speak, so you hear only a single, seemingly seamless flow of sound.
In just the same way, Calvin argues, you need a large cohort of nerve cells synchronously sending out the same message if you’re to override the distortion and degradation that inevitably involves some individual cells. Until you have plenty of spare cells positioned where they can be recruited to support the message, so that big choirs of neurons are all singing the same song, just building hierarchical structures can’t guarantee that words or structure won’t come out garbled. Until that time, it’s safer, more reliable, to stick to the old beads-on-a-string routine.
Why not stick to that method anyway? Why switch to hierarchical structuring?
Language evolution, as I’ve said, is an autocatalytic process. It drives itself, selecting for things that will make it more effective. One of these things is sheer speed. Outside of life-or-death warning calls, speed doesn’t directly affect fitness, but any organism that can get its message across sooner and get on with its life has an advantage, certainly a social advantage, over one that’s slower. And haven’t you ever been driven to a fury by one of those speakers who doles out his words as if he were dispensing hard-earned cash, one or two word-coins at a time? Hierarchically structured speech, as I found when I compared pidgin and creole speakers in Hawaii, is up to three times faster than beads-on-a-string speech. The first was doomed to oust the second, as soon as it became fully viable.
LINGUISTIC VERSUS PROTOLINGUISTIC MODES
There is a story, doubtless apocryphal, about a West African state that for some reason changed from driving on the right to driving on the left (or vice versa). But drivers shouldn’t worry, a government spokesman announced, because “the change will take place gradually.”
Well, I’m sure the change from protolanguage took place that way (though without fatalities). However, it was a very similar kind of situation. You either drive on the left or drive on the right—there’s no intermediate stage (driving down the center stripes doesn’t count). In just the same way, you use either protolanguage—beads on a string—or real language—Merge with hierarchical structure. There could not have been, as some seem to suppose, a series of changes in protolanguage that brought it gradually closer to real language; either an utterance is hierarchically structured or it isn’t. It was simply that more and more proto-people would use language, and those who used it would do so more and more of the time.
The situation is complicated by the fact that you can’t necessarily tell whether a given utterance was produced linguistically. Take a simple sentence such as “I like chocolate.” It could be structured as in (A) or (B):
Perhaps the most important point to grasp here is that to make language the brain doesn’t have to put things together in the same way as it does to make protolanguage.
If you’re using protolanguage, sending each word to the speech organs the moment it bobs up, words have to be put together in the sequence in which they are uttered. There’s no avoiding this. It’s a logical necessity.
If you’re using language, forming phrases and short clauses in the brain before uttering them, you don’t have to put words together in the sequence in which they’ll be uttered. In principle, you could assemble them any which way, so long as the completed phrase comes out right. In practice, it’s most probable that the brain assembles sentences from the bottom up, by the simple process of first combining the words that are closest to one another.
In “I like chocolate,” which are closest—“I” and “like,” or “like” and “chocolate”? Well, you can put things between “I” and “like” that you can’t between “like” and “chocolate”—“I sometimes like chocolate” but not “I like sometimes chocolate”—so “like” and “chocolate” are combined first, and only then is “I” combined with “like chocolate.”
Note that this process—what Chomsky’s minimalist program calls Merge—gives you hierarchical structure for free, so to speak. As for the linear order of spoken sentences, unavoidable if everything has to come out of one mouth, you can get this just by reading off the words at the end of each branch of the tree, from left to right.
Beads-on-a-string won’t support long and complex sentences. There are several reasons why this is so. First, longer messages sent this way would take up so much time that the receiver (and maybe even the sender!) would have forgotten the beginning before the end was reached. Second, the speaker would somehow have to keep all the parts of the sentence together without the support of any brain-internal processing. Third, even assuming these two obstacles could be overcome, structural ambiguities—ambiguities due to the absence of syntax, leaving you uncertain what went with what, and where phrases and clauses began and ended—would rapidly accumulate until the processing load on the receiver became too heavy.
I grant you that in a few seconds, context and common se
nse between them would usually tell you what was meant. But you don’t have those seconds. By the time you’ve used them up, the conversation’s already far ahead, and you’ll fall further behind with every fresh ambiguity you have to resolve. For any attempt to produce utterances longer and more complex than four- or five-word strings simply piles up the ambiguities. In language, on the other hand, structure is thoroughly predictable, and there are plenty of signals as to what that structure is. Intonation, for example: an intonation contour, a curving rise or fall or rise-fall of the voice, follows the structure of the syntax and almost always marks where boundaries between clauses lie. But you can’t have a sustained intonation contour where words are popping out one at a time.
Those who clung to the protolinguistic mode would eventually have become social cripples. People already fully linguistic would have been turned off by the slowness and clumsiness of their speech, would have treated them as dummies. The big advantage hierarchical processing has over beads-on-a-string is that it’s faster and also fully automatic. You don’t need context, you don’t need common sense; you just process the words and get the meaning instantly. The occasional ambiguities and so-called slips of the tongue (which have nothing to do with the tongue, of course, but everything to do with the degradation of neural signals described a few paragraphs ago) are a small price to pay for the immense saving in time and effort, and the ability to produce sentences of far greater subtlety and complexity than beads-on-a-string could ever achieve.
But how could hierarchical processing, in and of itself, secure such rapid and accurate comprehension?
The answer is that, in and of itself, it couldn’t. It needs to be supplemented by some system of templates, something that predicts with a high degree of accuracy the kinds of thing that hierarchical structures will produce.
Adam's Tongue: How Humans Made Language, How Language Made Humans Page 28