Dataclysm: Who We Are (When We Think No One's Looking)
Page 19
Among the many maps and charts and tables in Tufte’s books, there’s a two-page examination of the Vietnam Memorial, not as stonework or as history, but as an artifact of data design. I wish I could reprint the full discussion here, but the kernel is this:
From a distance the entire collection of names of 58,000 dead soldiers arrayed on the black granite yields a visual measure of what 58,000 means, as the letters of each name blur into a gray shape, cumulating to the final toll.
To find meaning in that gray blur is what every data scientist hopes for, and I’ve sought that distance and that blur repeatedly in these pages, drawing from the biggest data sets, looking at the widest stories, all to better my chances at truth.
The memorial was digitized in 2008. Every square inch was photographed and collated with military records, and the online version allows visitors to attach photos and text to each name. The web archive confronts the visitor with an empty box, demanding, “Search the Wall.” After a pause, I started to type my dad’s name, because when I think of Vietnam I think of him almost as a reflex. But then I remembered, gratefully, David Patton Rudder isn’t on this list. So I entered someone’s name, just a guess—“John” of course and then because Smith seemed too bland and Doe too hokey, “Wilson.” The page churned for a half second, and at the top I saw:
Lorne John Wilson
Tour Start Date 1969-03-17
Tour End Date 1969-03-28
Death Date 1969-03-28
Age 20
Two pictures had been added to his entry, one his portrait in dress blues, the other a snapshot, perhaps taken one of those eleven days PFC Wilson was in-country and alive. It shows four young men around a jeep, one’s standing in the back; they’re just talking in the afternoon. Grainy and undersaturated, but for the fatigues it could’ve come from Instagram. Whoever uploaded it had held on to the picture, and his friends, for decades.
A web page can’t replace granite. It can’t replace friendship or love or family, either. But what it can do—as a conduit for our shared experience—is help us understand ourselves and our lives. The era of data is here; we are now recorded. That, like all change, is frightening, but between the gunmetal gray of the government and the hot pink of product offers we just can’t refuse, there is an open and ungarish way. To use data to know yet not manipulate, to explore but not to pry, to protect but not to smother, to see yet never expose, and, above all, to repay that priceless gift we bequeath to the world when we share our lives so that other lives might be better—and to fulfill for everyone that oldest of human hopes, from Gilgamesh to Ramses to today: that our names be remembered, not only in stone but as part of memory itself.
A Note on the Data
Numbers are tricky. Even without context, they give the appearance of fact, and their specificity forbids argument: 20,679 Physicians say “LUCKIES are less irritating.” What else is there to know about smoking, right? The illusion is even stronger when the numbers are dressed up as statistics. I won’t rehash the old wisdom there. But behind every number there’s a person making decisions: what to analyze, what to exclude, what frame to set around whatever pictures the numbers paint. To make a statement, even to just make a simple graph, is to make choices, and in those choices human imperfection inevitably comes through. As far as I know, I’ve made no motivated decision that has bent the outcome of my work—the data of people acting out their lives is interesting enough without me needing to lead it one way or another. But I have made choices, and those choices have affected the book. I’d like to walk you through a few of them.
My first choice was probably my most difficult: the decision to focus on male-female relationships when I talk about attraction and sex. Space, of course, was a factor—to include same-sex relationships would’ve meant repeating each graph or table in triplicate. But more than that was the discovery that same-sex relationships aren’t exceptional—they follow all the same trends. Gay men, for example, prefer younger partners just like straight men do. For issues that have to do with sex only indirectly, such as ratings from one race to another, gays and straights also show similar patterns. Male-female relationships allowed for the least repetition and widest resonance per unit of space, so I made the choice to focus on them.
My second decision, to leave out statistical esoterica, was made with much less regret. I don’t mention confidence intervals, sample sizes, p values, and similar devices in Dataclysm because the book is above all a popularization of data and data science. Mathematical wonkiness wasn’t what I wanted to get across. But like the spars and crossbeams of a house, the rigor is no less present for being unseen. Many of the findings in the book are drawn from academic, peer-reviewed sources. I applied the same standards to the research I did myself, including a version of peer-review: much of the OkCupid analysis was performed first by me and then verified independently by an employee of the company. Also, I separated the analysis from the selection and organization of the data to make sure the former didn’t motivate the latter. One person would extract the information, another would try to figure out what it meant.
Sometimes, I present a trend and attribute a cause to it. Often that cause is my best guess, given my understanding of all the forces in play. To interpret results—a necessity in any book that isn’t just reams of numbers—I had to choose one explanation from a variety of possibilities. Is there some force besides age behind what I call Wooderson’s law (the fact that straight men of all ages are most interested in twenty-year-old women)? Perhaps. But I think it is very unlikely. “Correlation does not imply causation” is a good thing for everyone to keep in mind—and an excellent check on narrative overreach. But a snappy phrase doesn’t mean that the question of causation isn’t itself interesting, and I’ve tried to attribute causes only where they are most justified.
For almost all the parts of Dataclysm that overlap with posts on OkCupid’s blog, I chose to redo the work from scratch, on the most recent data, rather than quote my own previous findings. I did so because, frankly, I wanted to double-check what I’d done. The research published there from 2009 through 2011 was put together piecemeal. Many different people—I can count at least five—had pulled male-female message-reply rates for me over those three years, just to name one frequently used data point, and going back through my records of this data, there was no way to be sure what data set had generated the results. Doing it again myself, I could be sure. I could also enforce a uniform standard across all my research (for example, restricting analysis to only people ages twenty to fifty—a choice I made because those are the ages where I knew I had representative data).
Because the research is new, the numbers printed in Dataclysm are different from the numbers on the blog. Curves bend in slightly new ways. Graphs are a bit thicker or perhaps a bit thinner in places. The findings in the book and on the blog are nonetheless consistent. Ironically, with research like this, precision is often less appropriate than a generalization. That’s why I often round findings to the nearest 5 or 10 and the words “roughly” and “approximately” and “about” appear frequently in these pages. When you see in some article that “89.6 percent” of people do x, the real finding is that “many” or “nearly all” or “roughly 90 percent” of them do it, it’s just that the writer probably thought the decimals sounded cooler and more authoritative. The next time a scientist runs the numbers, perhaps the outcome will be 85.2 percent. The next time, maybe it’s 93.4. Look out at the churning ocean and ask yourself exactly which whitecap is “sea level.” It’s a pointless exercise at best. At worst, it’s a misleading one.
If you trace the findings in Dataclysm back to the original sources, the OkCupid data isn’t the only place you’ll see discrepancies. This data of our lives, being itself practically a living thing, is always changing. For example, my Klout score, which is holding steady at 34 as I write these words, will have no doubt gone up by the time you read them, since part of my obligation to Crown will be to tweet about this book. User eng
agement, ho!
Sometimes the numbers shift for no obvious reason. My copy editor and I had a mess of a time pinning down the Google autocompletes for prompts like “Why do women …” Google had given each of us slightly different results (“… wear thongs?” was my third result to the above, presumably because that’s a typically male question [?]. Hers was “… wear bras?”). Then when I checked a few weeks later, I myself saw something different: “… wear high heels?” Since it was the most recent result, that’s what ended up in the book.
As interesting a tool as it is, the black box of Google’s autocomplete (and Google Trends, for that matter) is an example of one of the worst things about today’s data science—its opaqueness. Corroboration, so important to the scientific method, is difficult, because so much information is proprietary (and here OkCupid is as guilty as anyone). Even as most social media companies trumpet the hugeness and potential of their data, the bulk of it has stayed off-limits to the larger world. Data sets currently move through the research community like yeti—I have a bunch of interesting stuff but I can’t say from where; I heard someone at Temple has tons of Amazon reviews; I think L has a scrape of Facebook. That last is something I was told by three unrelated academics; they referred to another scientist by name, which I’ve here obscured. L does in fact have that rogue Facebook scrape—I met him and confirmed—but he can’t show it to anyone. He’s really not supposed to have it at all. Data is money, which means companies treat it as such—and though some digital data sits out in the open, it’s secured behind legal walls as thick as any vault’s. If you look at your friend Lisa’s Facebook page, observe that her name is Lisa, and publish that fact (anywhere!)—you have technically stolen Facebook’s data. If you’ve ever signed up for a website and given a fake zip code or a fake birthday, you have violated the Computer Fraud and Abuse Act. Any child under thirteen who visits newyorktimes.com violates their Terms of Service and is a criminal—not just in theory, but according to the working doctrine of the Department of Justice.1 The examples I’ve laid out are extreme, sure, but the laws involved are so broadly written as to ensure that, essentially, every Internet-using American is a tort-feasing felon on a lifelong spree of depraved web browsing. Whether anyone penalizes you for your “crime” is another matter, but, legally, you are prostrate, a boot on your neck. A company’s general counsel, or a district attorney looking to please an important corporate donor, can destroy your life simply by deciding to press. When it suits, they do. So social scientists are very cagey with data sets; actually, more than yeti, they treat them like big bags of weed—possessive, slightly paranoid, always curious who else is holding and how dank that shit is.
Increasingly the preferred practice is to bring researchers in-house rather than release information outside.2 And that approach has yielded, among many fruits, the novel research by Facebook’s data team and Seth Stephens-Davidowitz’s fine work at Google, both of which I’ve drawn on here. I hope more companies follow this model, and that eventually we, the owners of the sites, will find a way to release our data for the public good without jeopardizing our users’ privacy in the act.
It’s old hat now, but the app Shazam was, to me, one of the first great wonders of the iPhone. It’s a little program for identifying music—if some song is playing, and you want to know what it is, you just turn on the app and hold up your phone. Shazam listens through the microphone, and, like, two seconds later, it tells you what you’re listening to. The first time someone did it in front of me, I was just blown away, not only at how little the software needed to get the song right (it can often work through walls or above the din of a bar), but at how fast it worked. It was the closest thing I’d seen to magic, at least until I came to know a certain able necromancer who, at a whim, could summon fees and add them to my goddamn kitchen renovation. But anyway, as I later found out, Shazam relies on an incredible principle: that almost any piece of music can be identified by the up/down pattern in the melody—you can ignore everything else: key, rhythm, lyrics, arrangement … To know the song, you just need a map of the notes’ rise and fall. This melodic contour is called the song’s Parsons code, named after the musicologist who developed it in the 1970s. The code for the first two lines of “Happy Birthday” is •RUDUDDRUDUD, with U meaning “melody up,” D meaning “melody down,” and R for “repeated note.” The dot • just marks the beginning of the tune, which of course isn’t up or down from anything. Hum it to yourself to check:
As crazy as it seems, the code for “Happy Birthday” is practically unique across the entire catalog of recorded music, as is the code for almost all songs. And it’s because these few letters are such a concise description that Shazam is so fast: instead of a guitar, Paul McCartney, and just the right amount of reverb, “Yesterday” starts with •DRUUUUUUDDR. That’s a lot easier to understand.
Like an app straining for a song, data science is about finding patterns. Time after time, I—and the many other people doing work like me—have had to devise methods, structures, even shortcuts to find the signal amidst the noise. We’re all looking for our own Parsons code. Something so simple and yet so powerful is a once-in-a-lifetime discovery, but luckily there are a lot of lifetimes out there. And for any problem that data science might face, this book has been my way to say: I like our odds.
1 For more on the Kafkaesque implications of the CFAA, please see “Until Today, If You Were 17, It Could Have Been Illegal to Read Seventeeen.com Under the CFAA” and “Are You a Teenager Who Reads News Online? According to the Justice Department, You May Be a Criminal,” both published by the Electronic Frontier Foundation.
2 I wish this were called hotboxing, but sadly, no.
Notes
We no longer live in a world where a reader depends on endnotes for “more information” or to seek proof of facts or claims. For example, I imagine any reader interested in Sullivan Ballou will have Googled him long before she consults these notes and transcribes into her browser the links I’ve provided. So I have used this section to focus on the many sources that have contributed not only facts but ideas to this book. I’ve also used it to substantiate or explain claims about my own proprietary data.
Since the subject of Dataclysm is changing almost daily, I’ve decided to enhance this section online at dataclysm.org/endnotes, where you will find additional source material and findings from emerging research.
Introduction
10 million people will use the site For this number, I counted every person who logged into OkCupid in the twelve months trailing April 2014: 10,922,722.
Tonight, some thirty thousand couples It’s the great unknowable of running an online dating site: How many of the users actually meet in person? And what happens next? This passage represents my best guesses at some basic in-person metrics. I used two separate methods:
1. I assumed someone who’s actively using OkCupid goes on one date every other month. I think this is conservative. At roughly 4,000,000 active users each month, that means roughly 65,000 people go on dates each day, meaning roughly 30,000 couples.
2. Every day 300 couples wind their way through our “account disable” interface to let us know that they no longer need OkCupid specifically because they have found a steady relationship on OkCupid. These are couples who (a) are dating seriously enough to shut down their OkCupid accounts, and who (b) are willing to go through the trouble of filling out a bunch of forms to let us know their new relationship status. I estimate that Group B represents only 1 in 10 of the long-term couples actually created by the site. And I estimate that Group A represents the outcome of only 1 in 10 first dates. Therefore, there must be 3,000 long-term couples, from 30,000 first dates each day. Of every 3,000 long-term couples, I believe something less than 1 in 10 go on to get married. One way to look at this: How many serious relationships did you have before you found the person you settled down with? I imagine the average number is roughly 10.
These appraisals together are mutually supporting, a
t least of the “first dates” number, and even if it’s approximate, I think the deeper metrics follow plausibly.
ratings of pizza joints on Foursquare Ratings from a random sample of 305 New York City pizza places accessed through Foursquare’s public API.
the recent approval ratings for Congress These were collected from the 529 polls measuring “congressional job approvals” listed on the site real clearpolitics.com from January 26, 2009, through September 14, 2013. See realclearpolitics.com/epolls/other/congressional_job_approval-903.html#polls.
NBA players by how often The chart shows percent of games started for each of the players listed on a team roster for the 2012–2013 season on espn.com. Yes, I’m counting the 76ers as an NBA team.
6 percent This number comes from taking the geometric mean of the distances between each of the 21 discrete data points along the curves. So, for curves a and b, I calculated:
Which equals 0.056.
58 percent of men The male attractiveness curve is centered more than a whole standard deviation below the female. Translating the same disparity to IQ means that the median male IQ would be slightly lower than 85, which is the threshold for “borderline intellectual functioning.” For example, the US Army doesn’t accept applicants with IQs below 85. I say “brain damaged” as a bit of hyperbole meant to capture this shift. Strictly speaking, I mean that 58 percent of men would have IQs lower than 85.