THE CODEBREAKERS
Page 16
Fifteen years! For what Ibn ad-Duraihim would have solved in a few hours! Yet that has always been the story of civilizations.
Analyzing the frequency and contacts of letters is the most universal, most basic of cryptanalytic procedures. A knowledge of it is requisite to an understanding of all subsequent techniques of substitution cryptanalysis. Hence it seems worthwhile to give in some detail, with an English plaintext, an example of such a solution, much as Qalqashandi did in Arabic.
Cryptanalysis rests upon the fact that the letters of language have “personalities” of their own. To the casual observer, they may look as alike as troops lined up for inspection, but just as the sergeant knows his men as “the goldbrick,” “the kid,” “the reliable soldier,” so the cryptanalyst knows the letters of the alphabet. Though in a cryptogram they wear disguises, the cryptanalyst observes their actions and idiosyncrasies, and infers their identity from these traits. In ordinary monalphabetic substitution, his task is fairly simple because each letter’s camouflage differs from every other letter’s and the camouflage remains the same throughout the cryptogram.
How would he go about doing this for the following cryptogram?
The cryptanalyst would begin by counting each letter’s frequency (how often it occurs in a text) and its contacts (which letters it touches, and how many different ones). The frequency count of this cryptogram is this:
A widely used frequency table of 200 letters of normal English is this:
But it is not possible to simply list the letters in the cryptogram in the order of frequency, and then, lining that list up with one giving the letters of normal text in their order of frequency, mechanically replace the cipher with the “plain.” In this case, the two lists are:
Brute substitution of letters of the upper row for those of the lower at the beginning of the cryptogram would give this “plaintext” oluueooanceihanpjatd … Obviously, the two frequency counts do not match. Which is not surprising, since they are based on different texts, using different words with different letters in them. But whereas the relative frequencies may shift slightly, making, say, i more frequent than a in a particular case, the letters generally do not stray very far from their home areas in the frequency table. Thus, e, t, a, o, n, i, r, s, and h will normally be found in the high-frequency group; d, l, u, c, and m in the medium-frequency group; p, f, y, w, g, b, and v in the lows, and j, k, q, x, and z in the rare group. Furthermore, a sharp break in frequency usually sets off the highs from the mediums; the lowest of the highs, h, is normally 6 per cent, while the highest of the mediums, d, is only 4 per cent. This step-down is quite visible in the cryptogram’s frequency count:
It is the drop between F and C. Though one of the usual nine highs has slipped out of its category, the remaining eight letters above the division are almost certainly all high-frequency letters, N probably represents e, which is outstandingly the most common letter (about one in every eight of normal text). Frequency alone cannot tell much more than this.
But contact can. Every letter has a cluster of preferred associations that constitute its most distinguishing characteristic. The cryptanalyst can spot these almost by eye if he sets up a contact chart for the high-frequency cipher letters like the one on the following page. In the chart, the letter being counted stands at the left, with the other letters strung out in order of frequency in a line to the right. Each tally above a letter in the line means that the letter in that line has preceded the subject letter in one instance, while each tally below means that it has followed the subject letter.
The chart shows that H has preceded N three times—in other words, that the digraph HN has occurred three times—and has followed it, to make NH, just once.
In a chart like this, plaintext e is about as hard to recognize under its cipher masquerade as a six-and-a-half-foot-tall man at a costume party. It is president of this republic of letters because it leads all the rest in frequency, yet it is democratic enough to contact more different letters more often than any other letter, including a goodly number in the low-frequency bracket. Indubitably, N here is President e.
Next most distinctive are the three high-frequency vowels, a, i, and o. Like rival dowagers at a society ball, they avoid one another as much as possible. A glance at the contact chart shows that ciphertext O, U, and A are the most mutually exclusive, (H, which rarely associates with U and A, is ruled out as a vowel possibility because it contacts O so often.) Thus, these three letters probably represent the three high-frequency plaintext vowels. Which is which can often be ascertained by the fact that the plaintext digraph io is fairly frequent while the other five combinations (oi, ia, ai, oa, ao) are fairly rare. The contact chart shows these frequencies: OA, 2; OU, 1; and UO, UA, AO, and AU, ail zero. If OA = io, then U would be a, and OU would be ia, which happens to be the most common of the other five digraphs. Better still, NU, which appears five times, would stand for ea, the most frequent of all the digraphs involving vowels, while UN, which does not exist in this message, would stand for ae, the rarest. This is a nice corroboration for the vowel identification. Even if identification of the individual vowels is not possible, it is nearly always wise to begin the analysis by determining which letters are the four high-frequency vowels.
What about consonants? The easiest to spot is plaintext n because four fifths of the letters that precede it are vowels. The contact chart shows that ciphertext T is preceded by ciphertext N, O, U, and A 17 times out of 23. It is a good bet for n.
The behavior of Y in the chart is striking. It runs before N (= e) like a herald and never follows it, while on the other hand it invariably tags along behind H and never precedes it. It behaves, in fact, just like plaintext h. The digraph he is one of the most common in English, while eh is very rare; th is the most common of all, and ht is also fairly rare. If Y = h, then ciphertext H must be t—an assumption that fits in well with its frequency. In telegraphic texts where the is deleted, plaintext h can usually be spotted because—just the opposite of n—it precedes vowels about ten times as often as it follows them.
The only two high-frequency letters remaining to be identified are r and s. The basic difference between them is that r, rather like a social climber, associates much more with the vowels—dowagers a,i, and o as well as President e—than does s, while s, a proletarian at heart, mingles with the consonants, the blue-collar laborers of the alphabet. These differences in their contacts hold both absolutely and relatively. In the chart, however, inspection of the contact bars for G and F, the only two high-frequency letters left, yields contradictory evidence: F contacts the identified vowels more often than G-21 times to G’s 17—but it also contacts the three high-frequency consonants (i, n, h) more—4 times to G’s 3—even though its frequency is lower.
It is not necessary to force the decision, for even without these identifications, 160 out of the 280 letters in the message have been given tentative plaintext equivalents. The acid test as to whether they are right, of course, consists in substituting them into the cryptogram and seeing whether they make sense. In doing so, many cryptanalysts use pencils of different colors for the plain- and the ciphertext to make them easier to distinguish. They also leave a lot of space between the lines of the ciphertext to allow for multiple hypotheses, erasures, underlining of repetitions, and so on.
Just this portion of the message will suffice for its solution. The cryptanalyst uses these tentative identifications to root out the meanings of other cipher letters. He does this by guessing at what the missing letters should be to make up intelligible text. For example, near the beginning the plaintext sequence--ith- appears. This could be a portion of the word with.
No cryptanalyst, if asked, could at this point give any proof that his assumption is correct. All it is now is a kind of guess, guided only by the porous laws of probability. Successive guesses will either increasingly confirm it or contradict it, causing the cryptanalyst to discard it. But each successive assumption is put forth at first upon the same slim b
asis as this. Eventually, the internal consistency of the final result piles up such an immense weight of probability that the validity of the solution becomes a virtual certainty. But the cryptanalyst who seeks proof absolute for each assumption as he makes it will never find it—and he will never solve the cipher.
Here, however, with seems likely. This assumption means that M = w, and this equivalence can be filled in wherever M appears in the cryptogram, to see whether it suggests any more new words. Just ten letters down the line, it forms the sequence with-n-nown-i-…, which suggests the phrase with unknown. The long plaintext sequence -int-ition- provides a check: the J = u identity fits right in to form the word intuition. The new plaintext letters are inserted and used to provide clues to still more letters. This process of reconstructing the plaintext—perhaps the easiest and the most fun in cryptanalysis—is called “anagramming.”*
It can be greatly speeded by a parallel reconstruction: that of the key alphabet. If the ciphertext letters are written under a normal alphabet that serves as the plaintext alphabet, their mere arrangement will often donate additional equivalences. The ciphertext listings thus far recovered are these:
Because it is difficult to remember an incoherent string of 26 letters that constitutes the set of cipher equivalents, cipher alphabets are often based on a single word that is easy to memorize. Various derivations are possible, but the simplest is just to write out the keyword, omitting repeated letters, then to follow it with the remaining letters of the alphabet. Thus the cipher alphabet springing from the keyword CHIMPANZEE would be:
The portion of the ciphertext alphabet following the keyword contains long alphabetical sequences. Often the cryptanalyst can complete segments that have been partially filled in, and thus recover more equivalencies. For example, if he sees QR-TU, he needs no great wit to realize that the missing letter must be S.
One such segment leaps to the eye in the partial alphabet recovered from the cryptogram: HJ-M. Only K or L can fit there, but since ciphertext K has already been assigned to plaintext k (from unknown), L must slide in to represent v, thereby giving the cryptanalyst a free identification. The technique can help in another way: to decide between F and G for r and s. If F = s and G = r, the sequence in the key alphabet under r and s would run backwards:
…rs…
…GF…
This is unlikely, so F = r and G = s. The cipher alphabet also gives ideas for plaintext equivalencies. For example, U = a in the alphabet, so if the cryptanalyst sees a V in the cryptogram, he may try b as one possibility for its plaintext to complete the UV segment under ab. In this case, it happens to work out right. With these new values inserted in the top two lines, the solution is virtually finished:
The two x’s must be two c’s to make success’, then B must be p to form ciphers; E must be f, for four; W, g for things and -ing; and so forth. At this point, hypotheses pour in literally faster than they can be written down. The plaintext (with punctuation supplied) reads: “Success in dealing with unknown ciphers is measured by these four things in the order named: perseverance, careful methods of analysis, intuition, luck. The ability at least to read the language of the original text is very desirable but not essential.” Such is the opening sentence of Parker Hitt’s Manual for the Solution of Military Ciphers.
The full key alphabet, including equivalents for plaintext j, q, and z, which did not appear, is based on the keyphrase NEW YORK CITY:
The careful examination of the propensities of the various plaintext letters may seem unnecessary. In the case of monalphabetic substitutions with word divisions, solution may often be obtained by taking a stab at common words (the, and), guessing at pattern words whose repeated letters form a distinctive configuration (WXYZY might be there), or comparing short words (HX, XH, HL, PL, and PX might be on, no, of, if, and in). But a knowledge of the characteristics of plaintext lies at the heart of the solution of more complex ciphers, where that plaintext is concealed more effectively. Naturally, in shorter cryptograms, solutions do not run quite as smoothly as the longer ones that allow the statistics of language enough play to become reliable. For these more difficult problems, expert solvers offer novices two tips: (1) make contact charts: the drudgery usually pays off in faster and more accurate identifications; (2) when stumped, and no likely plaintext values are visible, try something and see where it leads; even if it proves wrong, it has narrowed down the possibilities. No cryptogram was ever solved by staring at it. Finally, it should be noted that monalphabetic substitutions that use numbers or symbols as their ciphertext equivalents are solvable in the same way as those using letters. The difference in the camouflage does not alter the features of the underlying language.
* Unless otherwise noted, all dates are A.D.
* This usage of the term seldom conflicts with its traditional sense of rearranging letters of one text to spell another, like night to thing.
3
THE RISE OF THE WEST
WESTERN CIVILIZATION began the use of political cryptology that it has continued uninterrupted to the present as it emerged from the feudalism of the Middle Ages. The secret writing of that time was as embryonic as other elements of what was to become the world’s dominant civilization. Its use was at first infrequent and irregular; the systems were rudimentary, even in the church, still the greatest and most wide-ranging power of its day. But there were no longer any regressions, no thousand-year hiatuses. Cryptology only progressed. And from the earliest days there existed the two basic modern forms: codes and ciphers.
The substitutions of code stemmed in part from abbreviations, in part from obscure epithets and imagery used in oracles and magic half to reveal, half to conceal meanings. The oldest cryptographic document in the Vatican archives includes substitutions of both origins. This is a little list of name-equivalents compiled in 1326 or 1327 for use in the struggle between the propope Guelphs and the pro-Holy Roman Emperor Ghibellines in central Italy. It replaced the title official—evidently representing anyone of authority—by the single letter o. The Ghibellines became EGYPTIANS and the Guelphs the CHILDREN OF ISRAEL. A decade later, another list moved away from the jargon and introduced some secrecy to its abbreviations when it gave LORD A as the equivalent for our lord. Finally, on an undated slip of paper, perhaps a little later than the second list, appears the first modern code. It is very small but it manifests undiluted the essential attribute: the paramountcy of secrecy in its substitutions (though they secondarily enjoy the advantages of abbreviation): A = king, D = the Pope, S = Marescallus, and so on.
Ciphers, of course, had been used by monks all through the Middle Ages for scribal amusement, and the Renaissance knew from its study of such classic texts as Suetonius that the ancient world had used ciphers for political purposes. Hence the basic concept was already known. As early as 1226, a faint political cryptography appeared in the archives of Venice, where dots or crosses replaced the vowels in a few scattered words. A century and a half later, in 1363, the Archbishop of Naples, Pietro di Grazie, enciphered vowels fairly regularly in his correspondence with the papal curia and with cardinals. In 1379, the antipope Clement VII, who had fled to Avignon the previous year to begin the Great Schism of the Roman Catholic Church, in which two popes claimed to reign, saw the need of new ciphers for his new establishment. One of his secretaries, Gabrieli di Lavinde, a man from Parma who had perhaps worked in one of the chancelleries of the northern Italian city-states, compiled a set of individual keys for 24 correspondents of Clement, among them Niccolò of Naples, the Duke of Montevirdi, and the Bishop of Venice.
Lavinde’s collection of keys—the oldest extant in modern Western civilization—includes several that combine elements of both code and cipher. In addition to a monalphabetic substitution alphabet, often with nulls, nearly every key comprised a small repertory of a dozen or more common words or names with two-letter code equivalents. They constitute the earliest examples of a cryptographic system that was to hold sway over all Europe and America for
the next 450 years: the nomenclator. the nomenclator united the cipher substitution alphabet of letters and the code list of word, syllable, and name equivalents; it is a cross between the two basic systems. Code and cipher were separated in the early nomenclators but were merged in the later. The nomenclator eventually expanded its word-substitution lists from the few dozen names of Lavinde to the 2,000 or 3,000 syllables and words of those of Czarist Russia in the 1700s.
The West’s earliest known homophonic substitution cipher, used at Mantua with Simeone de Crema in 1401
The first substitution alphabets provided only a single substitute for each plaintext letter. Later ones supplied multiple substitutes. The first known Western instance of multiple cipher-representations occurs in a cipher that the Duchy of Mantua prepared in 1401 for correspondence with one Simeone de Crema. Each of the plaintext vowels has several possible equivalents. This testifies silently that, by this time, the West knew cryptanalysis. There can be no other explanation for the appearance of these multiple substitutes, or homophones. The cipher secretary of Mantua introduced them to hinder anyone who might try to solve an intercepted dispatch, for each extra cipher symbol means that much more work, that much more that has to be dug out by the cryptanalyst. That the homophones were applied to vowels, and not just indiscriminately, indicates a knowledge of at least the outlines of frequency analysis.
Where did that knowledge come from? It probably developed indigenously. Though it is true that contact with the Moslem and other civilizations during the Crusades triggered the cultural explosion of the Renaissance, and that Arabic works of science, mathematics, and philosophy poured into Europe from Moorish centers of scholarship in Spain, it seems unlikely that cryptanalysis emigrated from there. It was considered more a branch of grammar than of science or mathematics; it was linked too closely in Arabic tradition to the language of the Koran; its importance was much less than that of medicine or algebra or alchemy; in any case, neither Ibn ad-Duraihim’s nor Qalqashandi’s works, the only ones known to give a full explanation of the technique, were translated. It is possible but improbable that a diplomat to one of the Arabic lands may have brought back a knowledge of cryptanalysis. But cultural diffusion such as this would probably leave some written records, and none exist for any transfer of cryptanalysis from Islam to Christendom. It is dangerous to infer something from nothing, but given two possibilities, the nothing may imply one possibility more than the other: and it would rather be expected that no written records would be created if cryptanalysis developed spontaneously. The bright chancellery official who succeeded in puzzling out the meaning of the enciphered words in a captured dispatch would be hardly likely to give away, either orally or in writing, the knowledge that could bring him extra money and prestige.