THE CODEBREAKERS
Page 106
The title divulges the novel’s distinction: Gadsby, A Story of Over 50,000 Words Without Using the Letter E. It is an amazing tour de force. Let the skeptical reader see how long it takes to compose even a sentence without an e. The author of Gadsby, a persevering, dauntless, white-haired old gentleman named Ernest Vincent Wright, enumerated some of the problems of his self-imposed task. He had to avoid most verbs in the past tense because they end in -ed. He could never use the or the pronouns he, she, they, we, me, and them. Gadsby had to omit such seemingly indispensable verbs as are, have, were, be and been and such basic words as there, these, those, when, then, more, after and very. A purist, Wright refused to use numbers between 6 and 30, even as digits, because an e was implied when they were spelled out. (“When introducing young ladies into the story, this is a real barrier,” Wright complained, “for what young woman wants to have it known that she is over thirty.”) Similarly, he banned Mr. and Mrs. because of the e in their unabbreviated form. One of the most annoying problems would arise, when, near the end of a long paragraph, he could find no e-less word with which to complete the thought, and had to go back and rewrite the entire paragraph. So frequently did Wright find himself wanting to use a word containing an e that he had to tie down the e typebar of his typewriter to make it impossible for one to slip in.
“And many did try to do so,” he says in his preface. “As I wrote along, in long-hand at first, a whole army of little e’s gathered around my desk, all eagerly expecting to be called upon. But gradually as they saw me writing on and on, without even noticing them, they grew uneasy; and, with excited whisperings amongst themselves, began hopping up and riding on my pen, looking down constantly for a chance to drop off into some word; for all the world like seabirds perched, watching for a passing fish! But when they saw that I had covered 138 pages of typewriter size paper, they slid off onto the floor, walking sadly away, arm in arm; but shouting back: ‘You certainly must have a hodge-podge of a yarn there without Us! Why, man! We are in every story ever written, hundreds of thousands of times! This is the first time we ever were shut out!’ “ The story required, Wright declared, “five and a half months of concentrated endeavor, with so many erasures and retrenchments that I tremble as I think of them.”
Wright’s trembling dramatizes the tenacity and pervasiveness of the mere presence of a single letter in English. Others are equally tenacious, and other writers have, as literary curiosities, produced lipograms—writings in which one or more letters are deliberately omitted. A classical Greek author named Tryphiodorus reportedly composed an Odyssey whose first book excluded alpha, the second, beta, and so on through all 24 books. But despite the inflexibility of letter frequency and the wide variation among the frequencies of individual letters in all languages, it is so inconspicuous that many people never even suspect its presence.
One such was Christopher Latham Sholes, the inventor of the typewriter and, apparently, the perpetrator of its atrocious keyboard. The keyboard arrangement first appeared in a preproduction model produced in 1872. Vestiges of an alphabetical order appear in the dfghjkl of the second row, and it is rumored but not substantiated that the top row included the letters of the word “typewriter” so that salesmen could find them easily in demonstrations. The inefficiency of the qwertyuiop keyboard costs businessmen time and money. In a right-handed world, it gives the left hand 56 per cent of all strokes. Of all motions for successive letters, 48 per cent use only one hand instead of two. Thus words like federated and addressed force the left hand to leap frantically among the keys while the right languishes in unemployed torpor. Much more efficient is the even rhythm of the two-handed thicken. As if to emphasize the problem, touch-typing places the two most agile fingers of the right hand directly on keys for two of the least frequent letters of the alphabet, j and k.
These glaring faults have spurred design of numerous keyboards. The Minimotion keyboard, developed after an exhaustive statistical analysis by engineer Roy T. Griffith, raises the percentage of right-handed strokes to 52, of two-handed motions to 67 and of strokes in the touch-typist’s home row to 71 over the qwertyuiop keyboard’s 32. Tests in a Chicago elementary school showed that pupils learned to type twice as fast on another simplified keyboard, the Dvorak-Dealey, than on the standard. Experiments by a New York management consultant firm conclusively demonstrated the superiority of a keyboard that fits instead of fights principles of frequency. But all reform has been blocked by the inertia of typists who do not want to learn a new touch system all over again and by business firms who do not want to pay for the conversion of standard-keyboard typewriters.
Where men take advantage of the facts of letter frequency, they may reap extra profits. Samuel F. B. Morse is probably the best example. When he decided about 1838 to use an alphabetical system of signals for his newly invented electromagnetic telegraph, he counted the letters in a Philadelphia newspaper’s typecase so he could assign the shorter dot-and-dash symbols to the more common letters. He found 12,000 e’s, 9,000 t’s, 8,000 each of a, o, n, i, and s, 6,400, h’s, and so on. With few exceptions, he followed this list in his original code, assigning the shortest symbol, a dot, to the most common letter, e, another short symbol, a dash, to t, and so forth. With the modern International Morse Code, slightly different from his original American Morse, transmission of an English message of 100 letters requires about 940 dot-units. (The duration of a dot equals one dot-unit, a dash equals three dot-units, space between dots and dashes of a letter equals one dot-unit, space between letters equals three dot-units.) If the symbols had been assigned at random, the same message would run about 1,160 dot-units—or about 23 per cent longer. Morse’s perspicacity may have rewarded his successors financially by making it possible to handle almost one quarter more messages on a telegraph line within a rush period than if he had made up his code haphazardly.
Long before Morse, typefounders realized that it was to their advantage not to case as many q’s or z’s as e’s in a font, though they had to add extras in the rare letters to allow for occasional odd combinations, such as Hamlet’s “Buzz, Buzz!” The practice is still current: the font of 12-point Bodoni Book (a standard body type) sold by American Type Founders contains 53 lowercase e’s and only 6 z’s. Similarly, Ottmar Mergenthaler decided that the letter matrices in his Linotype should be arranged in order of the demand for each letter, perhaps to speed composition by having the more frequent letters traverse a shorter distance. This put lower-case e at the extreme left, followed by t, a, o, and so on. Since the key that controls each letter must be situated under its matrices’ channel, the keyboard as assembled reflects the frequency of letters in English:
A font of type, showing the greater quantities of high-frequency letters
This accounts for the etaoin shrdlu sometimes seen in newspapers: linotypists just run their fingers down the keys to fill out an incorrect line.
Even more scientifically designed is the panel of the Mergenthaler company’s Linofilm system. This system passes light through pictures of letters onto a film, where the successive images form text usable in offset photolithography. The pictures are mounted on the panel in an arrangement that exploits not only monographic but digraphic frequencies to minimize the shifting of the panel during composition. For example, t and h lie next to one another.
These examples imply that the frequencies of letters do remain fairly constant. Actual frequency counts back this up. A number of cryptologists have counted the numbers of e’s in German texts of about 1,000 letters, and their percentages vary only slightly: Kasiski, 18.4; Valério, 18.3; Carmona, 18.5; Hitt, 16; Givierge, 18; Lange and Soudart, 18.8; Baudouin, 19.2; Pratt, 16.7. These may be compared with a frequency count that is as close to completeness as anyone is likely to get—a tabulation of no fewer than 59,298,274 letters, derived from a count of 20,000,000 German syllables made for linguistic purposes in 1898 by the philologist F. W. Kaeding, who was nothing if not indefatigable. Kaeding found 10,598,015 e’s, or 17.9 per cent.
What is perhaps most striking is that the eight shorter counts average to 18.0 per cent—a difference of only one e per thousand letters from the Kaeding standard. Thus does language cleave to its statistical norms!
Why? The answer may be found within the theory formulated after World War II that not only explains cryptanalysis but also extends far beyond. It is called “information theory” or, sometimes, a “mathematical theory of communication.” It deals in general with the mathematical laws that govern systems designed to communicate information. Originating in transmission problems of telephony and telegraphy, it has grown to embrace virtually all information-processing devices, from standard communications systems to electronic computers and servomechanisms, and even the nerve networks of animals and men. Its ideas have proved so suggestive that they have been adapted to such fields as psychology, linguistics, molecular genetics, history, statistics, and neurophysiology. Because of this fertility, and because of its potential in helping to manage the information explosion of the 20th century, information theory may eventually rank, Fortune magazine has speculated, among the “enduring great” theories of man. The brilliant mind that fathered it also sired its cryptologic applications.
Claude Elwood Shannon was born in Petoskey, Michigan, on April 30, 1916, and was raised in nearby Gaylord, a small town in the north-central portion of Michigan’s southern peninsula. He majored in electrical engineering and mathematics at the University of Michigan and there developed an interest in communications and cryptology. At the Massachusetts Institute of Technology, where in 1940 he was awarded a Ph.D. in mathematics, he wrote a master’s thesis of such originality that it had an immediate impact on the designing of telephone systems. After a year at the Institute for Advanced Study in Princeton, he joined the staff of the Bell Telephone Laboratories.
There he built a maze-solving mouse, used to study circuitry for logic machines, and worked on a chess-playing machine, which may be regarded as the first step in the construction of computers for evaluating military situations and deciding the best move. At one time, he was an expert tightrope walker and unicycle rider and could be seen riding his one-wheeler up and down the halls of the Bell Laboratories. These proficiencies resulted from his attempts to design a form of Pogo stick that would bounce around by itself; this never materialized, but he did succeed in producing a bicycle that maintained its own balance. He has been teaching at M.I.T. since 1956. A thin (135 pounds on a five-foot, ten-inch frame), shy man, he likes science fiction, jazz, chess, and mathematics and admits to changing hobbies very rapidly. He lives with his wife and co-worker, Betty, and their three children in a house full of awards and honors in Winchester, Massachusetts.
“During World War II,” he has said, “Bell Labs were working on secrecy systems. I’d worked on communication systems and I was appointed to some of the committees studying cryptanalytic techniques. The work on both the mathematical theory of communications and the cryptology went forward concurrently from about 1941. I worked on both of them together and I had some of the ideas while working on the other. I wouldn’t say one came before the other—they were so close together you couldn’t separate them.” Though the work on both was substantially complete by about 1944, he continued polishing them until their publication as separate papers in the abstruse Bell System Technical Journal in 1948 and 1949.
Both articles—”A Mathematical Theory of Communication” and “Communication Theory of Secrecy Systems”—present their ideas in densely mathematical form, pocked with phrases like “this inverse must exist uniquely” and expressions like “TiRj(TkR1)-1TmRn.” But Shannon’s terse and incisive style breathes life into them. The first paper gave birth to information theory; the second dealt with cryptology in information-theory terms.
Chief among their new concepts is that of redundancy. Redundancy retains, in information theory, the essence of its lay meaning of needless excess, but it is refined and extended. Roughly, redundancy means that more symbols are transmitted in a message than are actually needed to bear the information. To take Shannon’s own elementary example, the u of qu is redundant because q is always followed by u in English words. Many of the the’s of ordinary language are redundant: persons sending telegrams get along without them. Just how great the excess of symbols is in English words is vividly demonstrated by some of those Army or Air Force adjutant-general communications that take off into a wild blue yonder of abbreviated words and phrases like “Off pres on AD for an indef per.” The initiated usually has no trouble in understanding this as what would normally be written, “Officer present on active duty for an indefinite period.”
Redundancy arises from the excess of rules with which languages burden themselves. These rules are mostly prohibitions—“Thou shalt not say ‘dese’ or ‘dose’ for ‘these’ or ‘those’ ”; “Thou shalt not spell ‘separate’ as ‘seperate’ ”; “Thou shalt not say ‘is’ after ‘I.’ ” All such limitations exclude perfectly usable combinations of letters. If a language permitted any permutation of, say, four letters to be a word, such as “ngwv,” then 456,976 words would exist. This is approximately the number of entries in an unabridged English dictionary. Such a language could, therefore, express the same amount of information as English. But because English prohibits such combinations as “ngwv,” it must go beyond the four-letter limit to express its ideas. Thus English is more wasteful, more redundant than this hypothetical four-letter language.
The rules that lead to redundancy come from grammar (“I am,” not “I is”), phonetics (no word in English may begin with ng), idiom (“believe” alone may not be followed by an infinitive, only by a clause beginning with “that”). Others come from etymology, in which the derivation of a word has left many now-silent letters, as in “through” or “knight.” Still others come from limitations on vocabulary. A teen-ager who uses “swell” to mean what an adult might designate by a dozen different terms of approbation utters speech that is much more redundant, more restricted, less variable, less flexible than the adult’s. As Shannon wrote, “Two extremes of redundancy in English prose are represented by Basic English and by James Joyce’s book Finnegans Wake. The Basic English vocabulary is limited to 850 words and the redundancy is very high. This is reflected in the expansion that occurs when a passage is translated into Basic English. Joyce on the other hand enlarges the vocabulary and is alleged to achieve a compression of semantic content.”
Two other sources of redundancy are of particular importance for their role in determining the frequency table. One derives from the relationships to which human beings refer so often and which language necessarily reflects. These are the relations of one person to another (“the son of John”), of one object to another (“the book on the table”), of an object to an action (“put it down”). English-expresses many of these relationships by separate words, called “function words.” Pronouns, prepositions, articles, conjunctions are all function words. Some stand for purely grammatical relationships that serve as a kind of linguistic shorthand—saying “I” instead of repeating one’s name all the time. Function words mean nothing standing alone. Yet they are among the most common words in English because the relationships they express are so common. In English, only ten of these words constitute more than one quarter of any text: the, of, and, to, a, in, that, it, is, and I totalled 26,677 of 100,000 words in a count made by Godfrey Dewey. Inevitably this preponderance will affect the frequency table. H, for example, owes most of its occurrences to the.
The second source of redundancy stems from the human laziness that favors sounds easier to pronounce and identify. The voiceless stops /ptk/ require less energy to articulate than the corresponding voiced stops /bdg/ and they average twice the frequency of voiced stops in sixteen widely varying languages surveyed by George K. Zipf. Similarly, short vowels are markedly more frequent than long vowels or diphthongs. In the same way, auditors of English, at least, seem to prefer sounds that are easier to identify. Tests made with nonsense syllables show that li
steners seldom confuse consonants produced with the vocal organs held in the same position but used in a different manner (such as /ntrsdlz/), but usually fail to distinguish consonants produced with the vocal organs used in the same manner but held in different positions (such as /ptk/). In the first group (the alveolar consonants), the tongue stays at the upper gum ridge but molds or interrupts the breath stream in different ways. In the second group (the voiceless stops), all the consonants block the breath stream and explosively release it, but at different positions of the lips and tongue. It is interesting to note that the easy-to-identify alveolar consonants comprise seven of the eight more-frequent consonants in English, while the two stops that are not alveolar (/pk/) lie well down in the frequency table. Incidentally, this preference for easily distinguishable consonants is one of the few explanations for the arrangement of even a few of the letters in the English frequency table.*
All these prohibitions and rules and tendencies help create redundancy. English is about 75 per cent redundant.† In other words, about three quarters of English text is “unnecessary.” English could theoretically express the same things with one quarter its present letters if it were wholly nonredundant. A literary curiosity demonstrates graphically how a few letters carry most of the information of a text while the others are redundant. The curiosity is entitled “Death and Life”:
In this, 65 per cent of the letters are in the central row (reading it twice), and serve both contradictory meanings equally well. Thus they add nothing to the information of the passage, all of which is carried in the remaining 35 per cent.
Anyone who knows English will know the rules of spelling and grammar and pronunciation that help engender its redundancy, and he will know these rules prior to the receipt of any new text in the language. This is almost tautological: it is only the existence of such rules that makes communication possible. If a hearer interprets “to” to mean “from,” he will not understand very much. If he pronounces a written m as/v/, at as /s/, and so on, he will not get through to his listeners. These redundant elements, these rules, may be considered the invariant portion of language. They may not be changed without loss of comprehension. But one may say what he wishes as long as he follows them. They are the preexistent mold into which the free-will portion of a communication is poured. Hence the enormous range of texts, from laws to poems, in the same language—which is to say, following the same rules.