How Not to Be Wrong : The Power of Mathematical Thinking (9780698163843)
Page 31
Except it didn’t work. The juvenile offenders are like Secrist’s low-performing stores: selected, not at random, but by virtue of being the worst of their kind. Regression tells you that the very worst-behaved kids this year will likely still be behavior problems next year; but not as much so. The decline in arrest rate is just what you’d expect even if Scared Straight had no effect.
Which isn’t to say Scared Straight was completely ineffective. When the program was put through randomized trials, where a randomly selected subgroup of juvenile offenders were put through Scared Straight and then compared to the remaining kids, who didn’t participate, researchers found that the program actually increased antisocial behavior. Maybe it should have been called Scared Stupid.
FIFTEEN
GALTON’S ELLIPSE
Galton had shown that regression to the mean was in effect whenever the phenomenon being studied was influenced by the play of chance forces. But how strong were those forces, by comparison with the effect of heredity?
In order to hear what the data was telling him, Galton had to put it in a form more graphically revealing than a column of numbers. He later recalled, “I began with a sheet of paper, ruled crossways, with a scale across the top to refer to the statures of the sons, and another down the side for the statures of their fathers, and there also I had put a pencil mark at the spot appropriate to the stature of each son and to that of his father.”
This method of visualizing the data is the spiritual descendant of René Descartes’s analytic geometry, which asks us to think about points in the plane as pairs of numbers, an x-coordinate and a y-coordinate, joining algebra and geometry in a tight clasp they’ve been locked in ever since.
Each father-son pair has an associated pair of numbers: namely, the height of the father followed by the height of the son. My father is six foot one and so am I—seventy-three inches each—so if we’d been in Galton’s data set we would have been recorded as (73,73). And Galton would have recorded our existence by making a mark on his sheet of paper with x-coordinate 73 and y-coordinate 73. Each parent and child in Galton’s voluminous records required another mark on the paper, until in the end his sheet bore a vast spray of dots, representing the whole range of variation in stature. Galton had invented the type of graph we now call a scatterplot.*
Scatterplots are spectacularly good at revealing the relationship between two variables; look in just about any contemporary scientific journal and you’ll see a raft of them. The late nineteenth century was a kind of golden age of data visualization. In 1869 Charles Minard made his famous chart showing the dwindling of Napoleon’s army on its path into Russia and its subsequent retreat, often called the greatest data graphic ever made; this, in turn, was a descendant of Florence Nightingale’s coxcomb graph* showing in stark visual terms that most of the British soldiers lost in the Crimean War had been killed by infections, not Russians.
The coxcomb and scatterplot play to our cognitive strengths: our brains are sort of bad at looking at columns of numbers, but absolutely ace at locating patterns and information in a two-dimensional field of vision.
In some cases, that’s easy. For instance, suppose that every son and father had equal height, the way my dad and I do. This represents a situation where chance plays no role at all, and your stature is completely determined by your patrimony. Then all the points in our scatterplot would have x and y coordinates equal; in other words, they’d be stuck to the diagonal line whose equation is x = y:
Note that the density of the dots is greater near the middle and less near the extremes; more men are five feet nine inches tall (sixty-nine inches) than are six foot one or five foot four.
Now what about the opposite extreme, where the heights of fathers and sons are totally independent? In that case, the scatterplot would look something like this:
This picture, unlike the first one, shows no bias toward the diagonal. If you restrict your attention to sons whose fathers were six foot one (seventy-three inches), corresponding to a vertical slice in the right half of the scatterplot, the dots measuring the height of the sons are still centered on five foot nine. We say that the conditional expectation of the son’s height (that is, how tall the son will be on average given that his father stands six foot one) is the same as the unconditional expectation (the average height of sons computed without any restriction on the father). This is what Galton’s sheet of paper would have looked like if there were no heritable differences at all affecting height. It’s regression to the mean in its most intense form; the sons of tall fathers regress all the way to the mean, ending up no taller than the sons of shorties.
But Galton’s scatterplot didn’t look like either of those two extreme cases. Instead, it was intermediate between them:
What does the average son of a six-foot-one-inch father look like in this plot? I’ve drawn a vertical slice to show you which points on the scatterplot correspond to those father-son pairs.
You can see that the dots near the “six-foot-one-inch father” slice are more heavily concentrated below the diagonal than above, so that the sons are on average shorter than the father. On the other hand, they are plainly biased to lie mostly above sixty-nine inches, the height of the average man. In the data set I plotted, the average height of those sons turns out to be just under six feet: taller than average, but not as tall as Dad. You are looking at a picture of regression to the mean.
Galton noticed very quickly that his scatterplots, generated by the interplay of heredity and chance, had a geometric structure that was anything but random. They seemed to be enclosed, more or less, by an ellipse, centered on the point where both parents and child were of exactly average height.
The tilted elliptical shape of the data is quite clear even in the raw data in the table reproduced on page 316, from Galton’s 1886 paper “Regression Towards Mediocrity in Hereditary Stature”; look at the figure formed by the nonzero entries in the table. The table also makes clear that I haven’t told the whole story of Galton’s data set; for instance, his y-coordinate is not “height of the father,” but “average of the father’s height with 1.08 times the mother’s height,”* what Galton calls the “mid-parent.”
In fact, Galton did more—he carefully drew curves on his scatterplot along which the density of points was roughly constant. Curves of this kind are called isopleths, and they’re very familiar to you, if not under that tongue-twisting name. If we start from a map of the United States, and draw a curve through all the cities where today’s high temperature is exactly 75 degrees, 50 degrees, or any other fixed value, you get the familiar curves of the weather map; these are called isotherms. A really hardcore weather map might also include isobars, joining areas of equal barometric pressure, or isonephs, areas of equal cloud cover. If we measure elevation instead of temperature, the isopleths are the contour lines, sometimes called isohypses, you find on topographic maps. This isopleth map shows the average annual number of snowstorms per year across the continental U.S.:
The isopleth wasn’t Galton’s invention; the first published isoplethic map was produced in 1701 by Edmond Halley, the British Astronomer Royal we last saw explaining to the king how to price annuities correctly.* Navigators already knew that magnetic north and true north didn’t always agree; understanding exactly how and where the disagreement appeared was obviously critical for successful ocean travel. The curves on Halley’s map were isogons, showing sailors where the discrepancy between magnetic north and true north were constant. The data was based on measurements Halley made aboard the Paramore, which crossed the Atlantic several times with Halley himself at the helm. (This guy really knew how to keep busy between comets.)
Galton found an amazing regularity: his isopleths were all ellipses, one contained within the next, each one with the same center. It was like the contour map of a perfectly elliptical mountain, with its peak at the pair of heights most frequently observed in Galton’s sample: average hei
ght for both parents and children. The mountain is none other than the three-dimensional version of the gendarme’s hat that de Moivre had studied; in modern language we call it the bivariate normal distribution.
When the son’s height is completely unrelated to those of the parents, as in the second scatterplot above, Galton’s ellipses are all circles, and the scatterplot looks roughly round. When the son’s height is completely determined by heredity, with no chance element involved, as in the first scatterplot, the data lies along a straight line, which one might think of as an ellipse that has gotten as elliptical as it possibly can. In between, we have ellipses of various levels of skinniness. That skinniness, which the classical geometers called the eccentricity of the ellipse, is a measure of the extent to which the height of the father determines that of the son. High eccentricity means that heredity is powerful and regression to the mean is weak; low eccentricity means the opposite, that regression to the mean holds sway. Galton called his measure correlation, the term we still use today. If Galton’s ellipse is almost round, the correlation is near 0; when the ellipse is skinny, lined up along the northeast-southwest axis, the correlation comes close to 1. By means of the eccentricity—a geometric quantity at least as old as the work of Apollonius of Perga in the third century BCE—Galton had found a way to measure the association between two variables, and in so doing had solved a problem at the cutting edge of nineteenth-century biology: the quantification of heredity.
A proper skeptical attitude now requires you to ask: What if your scatterplot doesn’t look like an ellipse? What then? There’s a pragmatic answer: in practice, the scatterplots of real-life data sets often do array themselves in rough ellipses. Not always, but often enough to make the technique widely applicable. Here’s what it looks like when you plot the share of voters who voted for John Kerry in 2004 against the share Barack Obama got in 2008. Each dot represents a single House district:
The ellipse is plain to see; and it’s very skinny; vote share for Kerry is highly correlated with vote share for Obama. The plot floats noticeably above the diagonal, reflecting the fact that Obama generally did better than Kerry.
Here’s a plot of several years of daily stock price changes for Google and GE:
Here’s a picture we’ve already seen, average SAT score plotted against tuition for a group of North Carolina colleges:
And here are the 50 U.S. states arranged in a scatterplot by average income and George W. Bush’s share of the 2004 presidential vote, with wealthy liberal states like Connecticut down in the lower right and Republican states of more modest means, like Idaho, in the upper left.
These data sets come from very different sources, but all four scatterplots arrange themselves in the same vaguely elliptical shape that the heights of parents and children did. In the first three cases, the correlation is positive; an increase in one variable is associated with an increase in the other, and the ellipse points northeast to southwest. In the last picture, the correlation is negative: In general, the richer states tend to skew more Democratic, and the ellipse points northwest to southeast.
THE UNREASONABLE EFFECTIVENESS OF CLASSICAL GEOMETRY
For Apollonius and the Greek geometers, ellipses were conic sections: surfaces obtained by slicing a cone along a plane. Kepler showed (although it took the astronomical community some decades to catch on) that the planets traveled in elliptical orbits, not circular ones as had been previously thought. Now, the very same curve arises as the natural shape enclosing heights of parents and children. Why? It’s not because there’s some hidden cone governing heredity which, when lopped off at just the right angle, gives Galton’s ellipses. Nor is it that some form of genetic gravity enforces the elliptical form of Galton’s charts via Newtonian laws of mechanics.
The answer lies in a fundamental property of mathematics—in a sense, the very property that has made mathematics so magnificently useful to scientists. In math there are many, many complicated objects, but only a few simple ones. So if you have a problem whose solution admits a simple mathematical description, there are only a few possibilities for the solution. The simplest mathematical entities are thus ubiquitous, forced into multiple duty as solutions to all kinds of scientific problems.
The simplest curves are lines. And it’s clear that lines are everywhere in nature, from the edges of crystals to the paths of moving bodies in the absence of force. The next simplest curves are those cut out by quadratic equations,* in which no more than two variables are ever multiplied together. So squaring a variable, or multiplying two different variables, is allowed, but cubing a variable, or multiplying one variable by the square of another, is strictly forbidden. Curves in this class, including ellipses, are still called conic sections out of deference to history; but more forward-looking algebraic geometers call them quadrics.* Now there are lots of quadratic equations: any such is of the form
A x2 + B xy + C y2 + D x + E y + F = 0
for some values of the six constants A, B, C, D, E, and F. (The reader who feels so inclined can check that no other type of algebraic expression is allowed, subject to our requirement that we are only allowed to multiply two variables together, never three.) That seems like a lot of choices—infinitely many, in fact! But these quadrics turn out to fall into three main classes: ellipses, parabolas, and hyperbolas.* Here’s what they look like:
We encounter these three curves again and again as the solution to scientific problems; not only the orbits of planets, but the optimal designs of curved mirrors, the arcs of projectiles, and the shapes of rainbows.
Or even beyond science. My colleague Michael Harris, a distinguished number theorist at the Institut de Mathématiques de Jussieu in Paris, has a theory that three of Thomas Pynchon’s major novels are governed by the three conic sections: Gravity’s Rainbow is about parabolas (all those rockets, launching and dropping!), Mason & Dixon about ellipses, and Against the Day about hyperbolas. This seems as good to me as any other organizing theory of these novels I’ve encountered; certainly Pynchon, a former physics major who likes to drop references to Möbius strips and the quaternions in his novels, knows very well what the conic sections are.
Galton observed that the curves he drew by hand looked like ellipses, but was not quite geometer enough to be sure that this precise curve, and not some other more or less ovoid figure, was actually in charge. Was he letting his desire for an elegant and universal theory affect his perception of the data he’d collected? He wouldn’t be the first or last scientist to make that mistake. Galton, careful as always, sought the advice of J. D. Hamilton Dickson, a mathematician at Cambridge. He even went so far as to conceal the origin of his data, presenting it as a problem arising from physics, to avoid prejudicing Dickson toward any particular conclusion. To Galton’s delight, Dickson quickly confirmed that the ellipse was not only the curve that the data suggested, but the curve that theory demanded.
“The problem may not be difficult to an accomplished mathematician,” Galton wrote, “but I certainly never felt such a glow of loyalty and respect towards the sovereignty and wide sway of mathematical analysis as when his answer arrived, confirming, by purely mathematical reasoning, my various and laborious statistical conclusions with far more minuteness than I had dared to hope, because the data ran somewhat roughly, and I had to smooth them with tender caution.”
BERTILLONAGE
Galton understood quickly that the idea of correlation wasn’t limited to the study of heredity; it applied to any pair of qualities that might bear some relation to one another.
As it happened, Galton was in possession of a massive database of anatomical measurements, of the sort that were enjoying a vogue in the late nineteenth century, thanks to the work of Alphonse Bertillon. Bertillon was a French criminologist with a spirit very much like Galton’s; he was devoted to a rigorously quantitative view of human life and confident about the benefits such an approach would bring.* In particular, Bertillon was ap
palled by the unsystematic and haphazard way in which French police identified criminal suspects. How much better and more modern it would be, Bertillon reasoned, to attach to each miscreant Frenchman a series of numerical measurements: the length and breadth of the head, the length of fingers and feet, and so on. In Bertillon’s system, each arrested suspect was measured and his data filed on cards and stored away for future use. Now, if the same man were nabbed again, identifying him was a simple matter of getting out the calipers, taking his numbers, and comparing them with the cards on file. “Aha, Mr. 15-6-56-42, thought you’d get away, didn’t you?” You can replace your name by an alias, but there’s no alias for the shape of your head.
Bertillon’s system, so in keeping with the analytic spirit of the time, was adopted by the Paris Prefecture of Police in 1883, and quickly spread throughout the world. At its height, bertillonage held sway in police departments from Bucharest to Buenos Aires. “The Bertillon cabinet,” Raymond Fosdick wrote in 1915, “became the distinguishing mark of the modern police organization.” In its time, the practice was so common and uncontroversial in the United States that Justice Anthony Kennedy brought it up in his majority opinion in the 2013 case Maryland vs. King, allowing states to take DNA samples from felony arrestees: in Kennedy’s view, a DNA sequence was just another sequence of data points attached to a suspect, a sort of twenty-first-century Bertillon card.
Galton asked himself: Was Bertillon’s choice of measurements the best possible? Or could you identify suspects more accurately if you took even more measurements? The problem, Galton realized, is that bodily measurements aren’t entirely independent. If you’ve already measured a suspect’s hands, do you really need to measure his feet, too? You know what they say about men with big hands: their feet are, statistically speaking, also likely to be of greater than average size. So the addition of the foot length doesn’t add as much information to the Bertillon card as one might initially hope. Adding more and more measurements—if they are poorly chosen—may provide steadily diminishing returns.