How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)

Page 27

by Ellenberg, Jordan

11 11 11 00 11 00 11 . . .

Now, when the cosmic ray strikes the second bit of the message, the satellite sees

10 11 11 00 11 00 11 . . .

The satellite knows that each two-bit segment is supposed to be either 00 or 11, so that initial “10” is a red flag; something has gone wrong. But what? That’s tough for the satellite to figure out: since it doesn’t know exactly where the noise corrupted the signal, there’s no way to tell whether the original message started with 00 or 11.

This problem, too, can be fixed. Just repeat three times instead of twice:

111 111 111 000 111 000 111 . . .

The message comes through corrupted, like this:

101 111 111 000 111 000 111 . . .

But now the satellite is in good shape. That first three-bit segment, it knows, is supposed to be 000 or 111, so the presence of 101 means something’s gone awry. But if the original message was 000, two bits must have been corrupted in very close proximity, an unlikely event as long as the frequency of message-zapping cosmic rays is pretty small. So the satellite has good reason to let majority rule: if two out of the three bits are 1, the odds are very good that the original message was 111.

What you’ve just witnessed is an example of an error-correcting code, a communications protocol that allows the receiver to cancel out the errors in a noisy signal.* The idea, like basically everything else in information theory, comes from Claude Shannon’s monumental 1948 paper, “A Mathematical Theory of Communication.”

A mathematical theory of communication! Doesn’t that sound a little grandiose? Isn’t communication a fundamentally human activity that can’t be reduced to cold numbers and formulas?

Understand this: I warmly endorse, in fact highly recommend, a bristly skepticism in the face of all claims that such-and-such an entity can be explained, or tamed, or fully understood, by mathematical means.

And yet the history of mathematics is a history of aggressive territorial expansion, as mathematical techniques get broader and richer, and mathematicians find ways to address questions previously thought of as outside their domain. “A mathematical theory of probability” sounds unexceptional now, but once it would have seemed a massive overreach; math was about the certain and the true, not the random and the maybe-so! All that changed when Pascal, Bernoulli, and others found mathematical laws that governed the workings of chance.* A mathematical theory of infinity? Before the work of Georg Cantor in the nineteenth century, the study of the infinite was as much theology as science; now, we understand Cantor’s theory of multiple infinities, each one infinitely larger than the last, well enough to teach it to first-year math majors. (To be fair, it does kind of blow their minds.)

These mathematical formalisms don’t capture every detail of the phenomena they describe, and aren’t intended to. There are questions about randomness, for instance, about which probability theory is silent. To some people, the questions that stay outside the reach of math are the most interesting ones. But to think carefully about chance, nowadays, without having probability theory somewhere in mind is a mistake. If you don’t believe me, ask James Harvey. Or, better yet, ask the people whose money he won.

Will there be a mathematical theory of consciousness? Of society? Of aesthetics? People are trying, that’s for sure, with only limited success so far. You should distrust all such claims on instinct. But you should also keep in mind that they might end up getting some important things right.

The error-correcting code does not at first seem like revolutionary mathematics. If you’re at a noisy party, repeat yourself—problem solved! But that solution has a cost. If you repeat every bit of your message three times, your message takes three times as long to transmit. That might not be a problem at a party, but it could be a problem if you need the satellite to turn on its right thruster right this second. Shannon, in the paper that launched the theory of information, identified the basic tradeoff that engineers still grapple with today: the more resistant to noise you want your signal to be, the slower your bits are transmitted. The presence of noise places a cap on the length of a message your channel can reliably convey in a given amount of time; this limit was what Shannon called the capacity of the channel. Just as a pipe can only handle so much water, a channel can only handle so much information.

But fixing errors doesn’t require you to make your channel three times skinnier, as the “repeat three times” protocol does. You can do better—and Shannon knew this perfectly well, because one of his colleagues at Bell Labs, Richard Hamming, had already figured out how.

Hamming, a young veteran of the Manhattan Project, had low-priority access to Bell’s ten-ton Model V mechanical relay computer; he was only allowed to run his programs on the weekends. The problem was that any mechanical error could halt his computation, with no one available to start the machine running again until Monday morning. This was annoying. And annoyance, as we know, is one of the great spurs to technical progress. Wouldn’t it be better, Hamming thought, if the machine could correct its own errors and keep on plugging? And so he developed a plan. The input to the Model V can be thought of as a string of 0s and 1s, just like the transmission to the satellite—the math doesn’t care whether those digits are bits in a digital stream, the states of an electrical relay, or holes in a strip of tape (at the time, a state-of-the-art data interface).

Hamming’s first step was to break the message up into blocks of three symbols:

111 010 101 . . .

The Hamming code* is a rule that transforms each of these three-digit blocks into a seven-digit string. Here’s the codebook:

000 -> 0000000

001 -> 0010111

010 -> 0101011

011 -> 0111100

101 -> 1011010

110 -> 1100110

100 -> 1001101

111 -> 1110001

So the encoded message would look like

1110001 0101011 1011010. . . .

Those seven-bit blocks are called code words. The eight code words are the only blocks the code allows; if the receiver sees anything else coming over the wire, something has gone wrong for sure. Say you receive 1010001. You know this can’t be right, because 1010001 isn’t a code word. What’s more, the message you received differs in only one place from the code word 1110001. And there’s no other code word that’s so close to the messed-up transmission you actually saw. So you can feel pretty safe in guessing that the code word your correspondent meant to send was 1110001, which means that the corresponding 3-digit block in the original message was 111.

You might think we just got lucky. What if the mystery transmission had been close to two different code words? Then we wouldn’t be able to make a confident judgment. But that can’t happen, and here’s why. Look again at the lines in the Fano plane:

124

135

167

257

347

236

456

How would you describe this geometry to a computer? Computers like to be talked to in 0s and 1s, so write each line as a string of 0s and 1s, where a 0 in place n stands for “point n is on the line” and a 1 in place n means “point n is not on the line.” So that first line, 124, gets represented as

0010111

and the second line, 135, is

0101011

You’ll notice that both strings are code words in the Hamming code. In fact, the seven nonzero code words in the Hamming code match up exactly to the seven lines in the Fano plane. The Hamming code and the Fano plane (and, for that matter, the optimal ticket combo for the Transylvanian lottery) are exactly the same mathematical object in two different outfits!

This is the secret geometry of the Hamming code. A code word is a set of three points in the Fano plane that form a line. Flipping a bit in the string is the same thing as adding or deleting a point, so as long as
the original code word wasn’t 0000000, the bollixed transmission you get corresponds to a set with either two or four points.* If you receive a two-point set, you know how to figure out the missing point; it’s just the third point on the unique line that joins the two points you received. What if you receive a four-point set of the form “line plus one extra point?” Then you can infer that the correct message consists of those three points in your set that form a line. A subtlety presents itself: how do you know there’s only one way of choosing such a set of three points? It helps if we give our points names: call them A, B, C, and D. If A, B, and C all lie on a line, then A, B, and C must be the set of points your correspondent meant to send. But what if A, C, and D also lie along a line? No worries: this is impossible, because the line containing A, B, and C and the line containing A, C, and D would then have the two points A and C in common. But two lines can only intersect in one point; that’s the rule.* In other words, thanks to the axioms of geometry, the Hamming code has the same magical error-correcting property as “repeat three times”; if a message gets modified by a single bit en route, the receiver can always figure out what message the transmitter meant to send. But instead of multiplying your transmission time by three, your new improved code sends just seven bits for every three bits of your original message, a more efficient ratio of 2.33.

The discovery of error-correcting codes, both Hamming’s first codes and the more powerful ones that came later, transformed the engineering of information. The goal no longer had to be building systems so heavily shielded and double-checked that no error could ever arise. After Hamming and Shannon, it sufficed to make errors rare enough that the flexibility of the error-correcting code could counteract whatever noise got through. Error-correcting codes are now found wherever data needs to be communicated quickly and reliably. The Mars orbiter Mariner 9 sent pictures of the Martian surface back to Earth using one such code, the Hadamard code. Compact discs are encoded with the Reed-Solomon code, which is why you can scratch them and they still sound perfect. (Readers born after, say, 1990 who are unfamiliar with compact discs can just think of flash drives, which use among other things the similar Bose-Chaudhuri-Hocquenghem codes to avoid data corruption.) Your bank’s routing number is encoded using a simple code called a checksum. This one is not quite an error-correcting code, but merely an error-detecting code like the “repeat each bit twice” protocol; if you type one digit wrong, the computer executing the transfer may not be able to puzzle out what number you actually meant, but it can at least figure out something’s wrong and avoid sending your money to the wrong bank.

It’s not clear whether Hamming understood the full range of applications of his new technique, but his bosses at Bell certainly had some idea, as Hamming found out when he tried to publish his work:

The Patent Department would not release the thing until they had patent coverage. . . . I didn’t believe that they could patent a bunch of mathematical formulas. I said they couldn’t. They said, “Watch us.” They were right. And since then I have known that I have a very weak understanding of patent laws because, regularly, things that you shouldn’t be able to patent—it’s outrageous—you can patent.

Math moves faster than the patent office: The Swiss mathematician and physicist Marcel Golay learned about Hamming’s ideas from Shannon, and developed many new codes of his own, not knowing that Hamming himself had worked out many of the same codes behind the patent curtain. Golay published first, leading to a confusion about credit that persists to the present day. As for the patent, Bell got it, but lost the right to charge for the license as part of an antitrust settlement in 1956.

What makes the Hamming code work? To understand this, you have to come at it from the other direction, asking: What would make it fail?

Remember, the bête noire of an error-correcting code is a block of digits that’s simultaneously close to two different code words. A recipient presented with the offending string of bits would be flummoxed, having no principled way to determine which of the near-miss code words appeared in the original transmission.

It sounds like we’re using a metaphor here: Blocks of binary digits don’t have locations, so what can we mean by saying one is “close” to another? One of Hamming’s great conceptual contributions was to insist that this wasn’t merely a metaphor, or didn’t have to be. He introduced a new notion of distance, now called the Hamming distance, which was adapted to the new mathematics of information just as the distance Euclid and Pythagoras understood was adapted to the geometry of the plane. Hamming’s definition was simple: the distance between two blocks is the number of bits you need to alter in order to change one block into the other. So the distance between the code words 0010111 and 0101011 is 4; in order to get from the former code word to the latter, you have to change the bits in the second, third, fourth, and fifth places.

Hamming’s eight code words are a good code because no block of seven bits is within Hamming distance 1 of two different code words. If it were, the two code words would be within Hamming distance 2 of each other.* But you can check for yourself and see that no two of those code words differ in just two places; in fact, any two different code words are at a Hamming distance of at least 4 from each other. You can think of the code words as something like electrons in a box, or antisocial people in an elevator. They’ve got a confined space to live in, and within those constraints, they try to make as much mutual distance from each other as they possibly can.

This same principle underlies all manner of communications that are robust to noise. Natural language works this way: if I write lanvuage instead of language, you can figure out what it was I meant to say, because there’s no other word in English that’s one letter substitution away from lanvuage. This breaks down, of course, when you start looking at shorter words: a dog, a cog, a bog, and a log are all perfectly good things to which one might refer in English, and a burst of noise wiping out the first phoneme makes it impossible to tell which one was meant. Even in this case, though, you can use the semantic distance between those words to help you correct errors. If it bit you, it was probably a dog; if you fell off it, it was probably a log. And so on.

You can make language more efficient—but when you do, you hit the same hard tradeoff Shannon discovered. Many people of a nerdy and/or mathematical persuasion* have labored to create languages that would convey information compactly and precisely, without any of the redundancy, synonymy, and ambiguity that languages like English indulge themselves in. Ro was an artificial language created in 1906 by the Reverend Edward Powell Foster, who aimed to replace the thicket of English vocabulary with a lexicon in which the meaning of each word could be derived logically from its sound. It’s perhaps no surprise that among the Ro enthusiasts was Melvil Dewey, whose Dewey Decimal System imposed on the stacks of the public library a similarly rigid organization. Ro is indeed admirably compact; lots of long English words, like ingredient, come out much shorter in Ro, where you just say cegab. But the compactness comes at a cost; you lose the error correction English offers as a built-in feature. Small elevator, very crowded, the passengers don’t have much personal space; which is to say, each word in Ro is very close to lots of others, creating opportunities for confusion. The word for “color” in Ro is bofab. But if you change one letter, to make it bogab, you have the word for “sound.” Bokab means “electricity” and bolab means “flavor.” Worse still, the logical structure of Ro leads similar-sounding words to have similar meanings too, making it impossible to figure out what’s going on from context. Bofoc, bofof, bofog, and bofol mean “red,” “yellow,” “green,” and “blue” respectively. It makes a certain kind of sense to have conceptual similarity represented in sound; but it also makes it very difficult to talk about color in Ro at a crowded party. “I’m sorry, was that ‘bofoc’ or ‘bofog’?”*

Some modern constructed languages, on the other hand, go the other way, making explicit use of the principles Hamming and Shannon laid o
ut; Lojban, one of the most successful contemporary examples,* has a strict rule that no two of the basic roots, or ginsu, are allowed to be too phonetically close.

Hamming’s notion of “distance” follows Fano’s philosophy—a quantity that quacks like distance has the right to behave like distance. But why stop there? The set of points at distance less than or equal to 1 from a given central point has a name in Euclidean geometry; it is called a circle, or, if we are in higher dimensions, a sphere.* So we’re compelled to call the set of strings at Hamming distance at most 1* from a code word a “Hamming sphere,” with the code word at the center. For a code to be an error-correcting code, no string—no point, if we’re to take this geometric analogy seriously—can be within distance 1 of two different code words; in other words, we ask that no two of the Hamming spheres centered at the code words share any points.

So the problem of constructing error-correcting codes has the same structure as a classical geometric problem, that of sphere packing: how do we fit a bunch of equal-sized spheres as tightly as possible into a small space, in such a way that no two spheres overlap? More succinctly, how many oranges can you stuff into a box?

The sphere-packing problem is a lot older than error-correcting codes; it goes back to the astronomer Johannes Kepler, who wrote a short booklet in 1611 called Strena Seu De Nive Sexangula, or “The Six-Cornered Snowflake.” Despite the rather specific title, Kepler’s book contemplates the general question of the origin of natural form. Why do snowflakes and the chambers of honeycombs form hexagons, while the seed chambers of an apple are more apt to come in fives? Most relevantly for us right now: why do the seeds of pomegranates tend to have twelve flat sides?

Here’s Kepler’s explanation. The pomegranate wants to fit as many seeds as possible inside its skin; in other words, it is carrying out a sphere-packing problem. If we believe nature does as good a job as can be done, then these spheres ought to be arranged in the densest possible fashion. Kepler argued that the tightest possible packing was obtained as follows. Start with a flat layer of seeds, arranged in a regular pattern like so:

‹ Prev Next ›