How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)
Page 32
To study this phenomenon, Galton made another scatterplot, this one of height versus “cubit,” the distance from the elbow to the tip of the middle finger. To his astonishment, he saw the same elliptical pattern that had emerged from the heights of fathers and sons. Once again, he had graphically demonstrated that the two variables, height and cubit, were correlated, even though one didn’t strictly determine the other. If two measurements are highly correlated (like the length of the left foot and the length of the right) there’s little point in taking the time to record both numbers. The best measurements to take are the ones that are uncorrelated with each of the others. And the relevant correlations could be computed from the vast array of anthropometric data Galton had already gathered.
As it happens, Galton’s invention of correlation didn’t lead to the institution of a vastly improved Bertillon system. That was largely thanks to Galton himself, who championed a competing system, dactyloscopy—what we now call fingerprinting. Like Bertillon’s system, fingerprinting reduced a suspect to a list of numbers or symbols that could be marked on a card, sorted, and filed. But fingerprinting enjoyed certain obvious advantages, most notably that a criminal’s fingerprints were often available for measurement in circumstances where the criminal himself was not. This point was made vividly by the case of Vincenzo Peruggia, who stole the Mona Lisa from the Louvre in a daring daylight theft in 1911. Peruggia had been arrested in Paris before, but his dutifully recorded Bertillon card, filed in its cabinet according to the lengths and widths of his various physical features, was not of much use. Had the cards contained dactyloscopic information, the fingerprint Peruggia left on the Mona Lisa’s discarded frame would have identified him at once.*
ASIDE: CORRELATION, INFORMATION, COMPRESSION, BEETHOVEN
I lied a little about the Bertillon system. In fact, he didn’t record the exact numerical value of each physical characteristic, but only whether it was small, medium, or large. When you measure the length of the finger, you divide the criminals into three groups: small-fingered, medium-fingered, large-fingered. And then when you measure the cubit, you divide each of these three groups into three subgroups, so that the criminals are divided ninefold in all. Making all five measurements in the basic Bertillon system divides the criminals into
3 × 3 × 3 × 3 × 3 = 35 = 243
groups; and for each of these 243, there are seven options for eye and hair color. So, in the end, Bertillon classified suspects into 35 × 7 = 1701 tiny categories. Once you’ve arrested more than 1701 people, some categories will inevitably contain more than one suspect; but the number of people in any one category is likely to be rather small, small enough that a gendarme can easily flip through the cards to find a photograph matching the man in chains before him. And if you cared to add more measurements, tripling the number of categories each time you did so, you could easily make categories so small that no two criminals—for that matter, no two Frenchmen of any kind—would share the same Bertillon code.
It’s a neat trick, keeping track of something complicated like the shape of a human being with a short string of symbols. And the trick isn’t limited to human physiognomy. A similar system, called the Parsons code,* is used to classify musical melodies. Here’s how it goes. Take a melody—one we all know, like Beethoven’s “Ode to Joy,” the glorious finale of the Ninth Symphony. We mark the first note with a *. And for each note thereafter, you mark down one of three symbols: u if the note at hand goes up from the previous note, d if it goes down, or r if it repeats the note that came before. The first two notes of Ode to Joy are the same, so you start out with *r. Then a higher note followed by a still higher one: *ruu. Next you repeat the top note, and then follow with a string of four descents: so the code for the whole opening segment is *ruurdddd.
You can’t reproduce the sound of Beethoven’s masterpiece from the Parsons code, any more than you can sketch a picture of a bank robber from his Bertillon measurements. But if you have a cabinet full of music categorized by Parsons code, the string of symbols does a pretty good job of identifying any given tune. If, for instance, you have the “Ode to Joy” in your head but can’t remember what it’s called, you can go to a website like Musipedia and type in *ruurdddd. That short string is enough to cut the possibilities down to “Ode to Joy” or Mozart’s Piano Concerto No. 12. If you whistle to yourself a mere seventeen notes, there are
316 = 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 × 3 = 43,046,721
different Parsons codes; that’s surely greater than the number of melodies ever recorded, and makes it pretty rare for two songs to have the same code. Each time you add a new symbol, you’re multiplying the number of codes by three; and thanks to the miracle of exponential growth, a very short code gives you an astonishingly high capacity for discriminating between two songs.
But there’s a problem. Back to Bertillon: What if we found that the men who came into the police station always had cubits in the same size category as their fingers? Then what look like nine choices for the first two measurements are really only three: small finger/small cubit, medium finger/medium cubit, and long finger/long cubit; two-thirds of the drawers in our Bertillon cabinet sit empty. The total number of categories is not really 1701, but a mere 567, with a corresponding diminution of our ability to distinguish one criminal from another. Another way to think of this: we thought that we were taking five measurements, but given that the cubit conveys exactly the same information as the finger, we were effectively taking only four. That’s why the number of possible cards is cut down from 7 × 35 = 1701 to 7 × 34 = 567. (The 7 is counting the possibilities for eye and hair color.) More relationships between the measurements would make the effective number of categories still smaller and the Bertillon system still less powerful.
Galton’s great insight was that the same thing applies even if finger length and cubit length aren’t identical, but only correlated. Correlations between the measurements make the Bertillon code less informative. Once again, Galton’s keen wisdom provided him a kind of intellectual prescience. What he’d captured was, in embryonic form, a way of thinking that would become fully formalized only a half-century later, by Claude Shannon in his theory of information. As we saw in chapter 13, Shannon’s formal measure of information was able to provide bounds on how quickly bits could flow through a noisy channel; in much the same way, Shannon’s theory provides a way of capturing the extent to which correlation between variables reduces the informativeness of a card. In modern terms we would say that the more strongly correlated the measurements, the less information, in Shannon’s precise sense, a Bertillon card conveys.
Nowadays, though Bertillonage is gone, the idea that the best way to keep track of identity is by a sequence of numbers has achieved total dominance; we live in a world of digital information. And the insight that correlation reduces the effective amount of information has emerged as a central organizing principle. A photograph, which used to be a pattern of pigment on a sheet of chemically coated paper, is now a string of numbers, each one representing the brightness and color of a pixel. An image captured on a 4-megapixel camera is a list of 4 million numbers—no small commitment of memory for the device shooting the picture. But these numbers are highly correlated with each other. If one pixel is bright green, the next one over is likely to be as well. The actual information contained in the image is much less than 4 million numbers’ worth—and it’s precisely this fact that makes it possible* to have compression, the critical mathematical technology that allows images, videos, music, and text to be stored in much smaller spaces than you’d think. The presence of correlation makes compression possible; actually doing it involves much more modern ideas, like the theory of wavelets developed in the 1970s and ’80s by Jean Morlet, Stéphane Mallat, Yves Meyer, Ingrid Daubechies, and others; and the rapidly developing area of compressed sensing, which started with a 2005 paper by Emmanuel Candès, Justin Romberg, and Terry Tao, and h
as quickly become its own active subfield of applied math.
THE TRIUMPH OF MEDIOCRITY IN WEATHER
There’s one thread we still need to tie off. We’ve seen how regression to the mean explains the “triumph of mediocrity” that Secrist discovered. But what about the triumph of mediocrity that Secrist didn’t observe? When he tracked the temperatures of U.S. cities, he found that the hottest ones in 1922 were still hottest in 1931. This observation is crucial to his argument that the regression of business enterprises was something specific to human endeavor. If regression to the mean is a universal phenomenon, why don’t temperatures do it too?
The answer is simple: they do.
The table below shows the average January temperatures in degrees Fahrenheit at thirteen weather stations in southern Wisconsin, no two of which are farther than an hour’s drive apart:
Jan 2011
Jan 2012
Clinton
15.9
23.5
Cottage Grove
15.2
24.8
Fort Atkinson
16.5
24.2
Jefferson
16.5
23.4
Lake Mills
16.7
24.4
Lodi
15.3
23.3
Madison airport
16.8
25.5
Madison arboretum
16.6
24.7
Madison, Charmany
17.0
23.8
Mazomanie
16.6
25.3
Portage
15.7
23.8
Richland Center
16.0
22.5
Stoughton
16.9
23.9
When you make a Galton-style scatterplot of these temperatures you see that, in general, the cities that were warmer in 2011 tended to be warmer in 2012.
But the three warmest stations in 2011 (Charmany, Madison airport, and Stoughton) ended up the warmest, seventh warmest, and eighth warmest in 2012. Meanwhile, the three coldest 2011 stations (Cottage Grove, Lodi, and Portage) got relatively warmer: Portage was tied for fourth coldest, Lodi was second coldest, and Cottage Grove was actually warmer in 2012 than most of the other cities. In other words, both the hottest and the coldest groups moved toward the middle of the rankings, just as with Secrist’s hardware stores.
Why didn’t Secrist see this effect? Because he chose his weather stations in a different way. His cities weren’t restricted to a small chunk of the upper Midwest, but were spread out much more widely. Suppose we look at the January temperatures as you range around California instead of Wisconsin:
Jan 2011
Jan 2012
Eureka
48.5
46.6
Fresno
46.6
49.3
Los Angeles
59.2
59.4
Riverside
57.8
58.9
San Diego
60.1
58.2
San Francisco
51.7
51.6
San Jose
51.2
51.4
San Luis Obispo
54.5
54.4
Stockton
45.2
46.7
Truckee
27.1
30.2
No regression to be seen. The cold places, like Truckee up in the Sierra Nevadas, stay cold, and the hot places, like San Diego and LA, stay hot. Plotting these temperatures gives you a very different-looking picture:
The Galtonian ellipse around these ten points would be very narrow indeed. The differences you see in the temperatures in the table reflect the fact that some places in California are just plain colder than others, and the underlying differences between the cities swamp the chance fluctuation from year to year. In Shannon’s language, we’d say there’s lots of signal and not so much noise. For the cities in south-central Wisconsin, it’s just the opposite. Climatically speaking, Mazomanie and Fort Atkinson are not very different. In any given year, the ranking of these cities by temperature is going to have a lot to do with chance. There’s lots of noise, not so much signal.
Secrist thought the regression he painstakingly documented was a new law of business physics, something that would bring more certainty and rigor to the scientific study of commerce. But it was just the opposite. If businesses were like cities in California—some really hot, some really not, reflecting inherent differences in business practice—you’d see correspondingly less regression to the mean. What Secrist’s findings really show is that businesses are much more like the cities in Wisconsin. Superior management and business insight play a role, but so does plain luck, in roughly equal measure.
EUGENICS, ORIGINAL SIN, AND THIS BOOK’S MISLEADING TITLE
In a book called How Not to Be Wrong it’s a bit strange to write about Galton without saying much about his greatest fame among non-mathematicians: the theory of eugenics, of which he’s usually called the father. If, as I claim, an attention to the mathematical side of life is helpful in avoiding mistakes, how could a scientist like Galton, so clear-eyed with regard to mathematical questions, be so wrong about the merits of breeding human beings for desirable properties? Galton saw his own opinions on this subject as modest and sensible, but they shock the contemporary ear:
As in most other cases of novel views, the wrong-headedness of objectors to Eugenics has been curious. The most common misrepresentations now are that its methods must be altogether those of compulsory unions, as in breeding animals. It is not so. I think that stern compulsion ought to be exerted to prevent the free propagation of the stock of those who are seriously afflicted by lunacy, feeble-mindedness, habitual criminality, and pauperism, but that is quite different from compulsory marriage. How to restrain ill-omened marriages is a question by itself, whether it should be effected by seclusion, or in other ways yet to be devised that are consistent with a humane and well-informed public opinion. I cannot doubt that our democracy will ultimately refuse consent to that liberty of propagating children which is now allowed to the undesirable classes, but the populace has yet to be taught the true state of these things. A democracy cannot endure unless it be composed of able citizens; therefore it must in self-defence withstand the free introduction of degenerate stock.
What can I say? Mathematics is a
way not to be wrong, but it isn’t a way not to be wrong about everything. (Sorry, no refunds!) Wrongness is like original sin; we are born to it and it remains always with us, and constant vigilance is necessary if we mean to restrict its sphere of influence over our actions. There is real danger that, by strengthening our abilities to analyze some questions mathematically, we acquire a general confidence in our beliefs, which extends unjustifiably to those things we’re still wrong about. We become like those pious people who, over time, accumulate a sense of their own virtuousness so powerful as to make them believe the bad things they do are virtuous too.
I’ll do my best to resist that temptation. But watch me carefully.
THE ADVENTURES OF KARL PEARSON ACROSS THE TENTH DIMENSION
It is difficult to overstate the impact of Galton’s creation of correlation on the conceptual world we now inhabit—not only in statistics, but in every precinct of the scientific enterprise. If you know one thing about the word correlation it’s that “correlation does not imply causation”—two phenomena can be correlated, in Galton’s sense, even if one doesn’t cause the other. This, by itself, was not news. People certainly understood that siblings are more likely than other pairs of people to share physical characteristics, and that this isn’t because tall brothers cause their younger sisters to be tall. But there’s still a causal relationship lurking in the background: the tall parents whose genetic contribution aids in causing both children to be tall. In the post-Galton world, you could talk about an association between two variables while remaining completely agnostic about the existence of any particular causal relationship, direct or indirect. In its way, the conceptual revolution Galton engendered has something in common with the insight of his more famous cousin, Charles Darwin. Darwin showed that one could meaningfully talk about progress without any need to invoke purpose. Galton showed that one could meaningfully talk about association without any need to invoke underlying cause.