Dataclysm: Who We Are (When We Think No One's Looking)
Page 1
Copyright © 2014 by Christian Rudder
All rights reserved.
Published in the United States by Crown Publishers,
an imprint of the Crown Publishing Group,
a division of Random House LLC,
a Penguin Random House Company, New York.
www.crownpublishing.com
CROWN and the Crown colophon are registered trademarks of Random House LLC.
Grateful acknowledgment is made to Psychology Today Magazine for permission to reprint an excerpt from “Final Analysis: Missed Connections” by Dorothy Gambrell (January/February 2013), copyright © 2013 by Sussex Publishers, LLC. Reprinted by permission of Psychology Today Magazine.
Image on this page: Film still from Dazed and Confused, copyright © 1993 by Polygram Filmed Entertainment. Reprinted by permission of Universal Studios Licensing LLC.
Table on this page: “Zipf’s Law and Vocabulary” by C. J. Sorell from The Encyclopedia of Applied Linguistics, edited by C. A. Chapelle (Oxford: Wiley-Blackwell, 2012). Reprinted by permission of the author.
Table on this page: Traits predicted by a Facebook user’s “likes” adapted from Figure 2, “Prediction accuracy of classification of dichotomous/dichotomized attributes expressed by the AUC” in “Private Traits and Attributes Are Predictable from Digital Records of Human Behavior” by Michael Kosinskia, David Stillwell, and Thore Graepel (Washington, DC: PNAS, 2013). Reprinted by permission of the Proceedings of the National Academy of Sciences of the United States of America.
Library of Congress Cataloging-in-Publication Data
Rudder, Christian.
Dataclysm : who we are (when we think no one’s looking) / Christian Rudder.—First Edition.
pages cm
1. Behavioral assessment. 2. Human behavior. 3. Social media. 4. Big data. I. Title. BF176.5.R83 2014
155.2′8—dc23 2014007364
ISBN 978-0-385-34737-2
eBook ISBN 978-0-385-34738-9
Jacket design by Christopher Brand
v3.1
CONTENTS
Cover
Title Page
Copyright
Introduction
Part 1.
What Brings Us Together
1. Wooderson’s Law
2. Death by a Thousand Mehs
3. Writing on the Wall
4. You Gotta Be the Glue
5. There’s No Success Like Failure
Part 2.
What Pulls Us Apart
6. The Confounding Factor
7. The Beauty Myth in Apotheosis
8. It’s What’s Inside That Counts
9. Days of Rage
Part 3.
What Makes Us Who We Are
10. Tall for an Asian
11. Ever Fallen in Love?
12. Know Your Place
13. Our Brand Could Be Your Life
14. Breadcrumbs
Coda
A Note on the Data
Notes
Acknowledgments
Index
Introduction
You have by now heard a lot about Big Data: the vast potential, the ominous consequences, the paradigm-destroying new paradigm it portends for mankind and his ever-loving websites. The mind reels, as if struck by a very dull object. So I don’t come here with more hype or reportage on the data phenomenon. I come with the thing itself: the data, phenomenon stripped away. I come with a large store of the actual information that’s being collected, which luck, work, wheedling, and more luck have put me in the unique position to possess and analyze.
I was one of the founders of OkCupid, a dating website that, over a very un-bubbly long haul of ten years, has become one of the largest in the world. I started it with three friends. We were all mathematically minded, and the site succeeded in large part because we applied that mind-set to dating; we brought some analysis and rigor to what had historically been the domain of love “experts” and grinning warlocks like Dr. Phil. How the site works isn’t all that sophisticated—it turns out the only math you need to model the process of two people getting to know each other is some sober arithmetic—but for whatever reason, our approach resonated, and this year alone 10 million people will use the site to find someone.
As I know too well, websites (and founders of websites) love to throw out big numbers, and most thinking people have no doubt learned to ignore them; you hear millions of this and billions of that and know it’s basically “Hooray for me,” said with trailing zeros. Unlike Google, Facebook, Twitter, and the other sources whose data will figure prominently in this book, OkCupid is far from a household name—if you and your friends have all been happily married for years, you’ve probably never heard of us. So I’ve thought a lot about how to describe the reach of the site to someone who’s never used it and who rightly doesn’t care about the user-engagement metrics of some guy’s startup. I’ll put it in personal terms instead. Tonight, some thirty thousand couples will have their first date because of OkCupid. Roughly three thousand of them will end up together long-term. Two hundred of those will get married, and many of them, of course, will have kids. There are children alive and pouting today, grouchy little humans refusing to put their shoes on right now, who would never have existed but for the whims of our HTML.
I have no smug idea that we’ve perfected anything, and it’s worth saying here that while I’m proud of the site my friends and I started, I honestly don’t care if you’re a member or go create an account or what. I’ve never been on an online date in my life and neither have any of the other founders, and if it’s not for you, believe me, I get that. Tech evangelism is one of my least favorite things, and I’m not here to trade my blinking digital beads for anyone’s precious island. I still subscribe to magazines. I get the Times on the weekend. Tweeting embarrasses me. I can’t convince you to use, respect, or “believe in” the Internet or social media any more than you already do—or don’t. By all means, keep right on thinking what you’ve been thinking about the online universe. But if there’s one thing I sincerely hope this book might get you to reconsider, it’s what you think about yourself. Because that’s what this book is really about. OkCupid is just how I arrived at the story.
I have led OkCupid’s analytics team since 2009, and my job is to make sense of the data our users create. While my three founding partners have done almost all the hard work of actually building the site, I’ve spent years just playing with the numbers. Some of what I work on helps us run the business: for example, understanding how men and women view sex and beauty differently is essential for a dating site. But a lot of my results aren’t directly useful—just interesting. There’s not much you can do with the fact that, statistically, the least black band on Earth is Belle & Sebastian, or that the flash in a snapshot makes a person look seven years older, except to say huh, and maybe repeat it at a dinner party. That’s basically all we did with this stuff for a while; the insights we gleaned went no further than an occasional lame press release. But eventually we were analyzing enough information that larger trends became apparent, big patterns in the small ones, and, even better, I realized I could use the data to examine taboos like race by direct inspection. That is, instead of asking people survey questions or contriving small-scale experiments, which was how social science was often done in the past, I could go and look at what actually happens when, say, 100,000 white men and 100,000 black women interact in private. The data was sitting right there on our servers. It was an irresistible sociological opportunity.
I dug in, and as discoveries built up, like anyone with more ideas than audience, I start
ed a blog to share them with the world. That blog then became this book, after one important improvement. For Dataclysm, I’ve gone far beyond OkCupid. In fact, I’ve probably put together a data set of person-to-person interaction that’s deeper and more varied than anything held by any other private individual—spanning most, if not all, of the significant online data sources of our time. In these pages I’ll use my data to speak not just to the habits of one site’s users but also to a set of universals.
The public discussion of data has focused primarily on two things: government spying and commercial opportunity. About the first, I doubt I know any more than you—only what I’ve read. To my knowledge, the national security apparatus has never approached any dating site for access, and unless they plan to criminalize the faceless display of utterly ripped abs or young women from Brooklyn going on and on about how much they like scotch, when, come on, you know they really don’t, I can’t imagine they’d find much of interest. About the second story, data-as-dollars, I know better. As I was beginning this book, the tech press was slick with drool over the Facebook IPO; they’d collected everyone’s personal data and had been turning it into all this money, and now they were about to turn that money into even more money in the public markets. A Times headline from three days before the offering says it all: “Facebook Must Spin Data into Gold.” You half expected Rumpelstiltskin to show up on the OpEd page and be like, “Yes, America, this is a solid buy.”
As a founder of an ad-supported site, I can confirm that data is useful for selling. Each page of a website can absorb a user’s entire experience—everything he clicks, whatever he types, even how long he lingers—and from this it’s not hard to form a clear picture of his appetites and how to sate them. But awesome though the power may be, I’m not here to go over our nation’s occult mission to sell body spray to people who update their friends about body spray. Given the same access to the data, I am going to put that user experience—the clicks, keystrokes, and milliseconds—to another end. If Big Data’s two running stories have been surveillance and money, for the last three years I’ve been working on a third: the human story.
Facebook might know that you’re one of M&M’s many fans and send you offers accordingly. They also know when you break up with your boyfriend, move to Texas, begin appearing in lots of pictures with your ex, and start dating him again. Google knows when you’re looking for a new car and can show the make and model preselected for just your psychographic. A thrill-seeking socially conscious Type B, M, 25–34? Here’s your Subaru. At the same time, Google also knows if you’re gay or angry or lonely or racist or worried that your mom has cancer. Twitter, Reddit, Tumblr, Instagram, all these companies are businesses first, but, as a close second, they’re demographers of unprecedented reach, thoroughness, and importance. Practically as an accident, digital data can now show us how we fight, how we love, how we age, who we are, and how we’re changing. All we have to do is look: from just a very slight remove, the data reveals how people behave when they think no one is watching. Here I will show you what I’ve seen. Also, fuck body spray.
If you read a lot of popular nonfiction, there are a couple things in Dataclysm that you might find unusual. The first is the color red. The second is that the book deals in aggregates and big numbers, and that makes for a curious absence in a story supposedly about people: there are very few individuals here. Graphs and charts and tables appear in abundance, but there are almost no names. It’s become a cliché of pop science to use something small and quirky as a lens for big events—to tell the history of the world via a turnip, to trace a war back to a fish, to shine a penlight through a prism just so and cast the whole pretty rainbow on your bedroom wall. I’m going in the opposite direction. I’m taking something big—an enormous set of what people are doing and thinking and saying, terabytes of data—and filtering from it many small things: what your network of friends says about the stability of your marriage, how Asians (and whites and blacks and Latinos) are least likely to describe themselves, where and why gay people stay in the closet, how writing has changed in the last ten years, and how anger hasn’t. The idea is to move our understanding of ourselves away from narratives and toward numbers, or, rather, to think in such a way that numbers are the narrative.
This approach evolved from long toil in the statistical slag pits. Dataclysm is an extension of what my coworkers and I have been doing for years. A dating site brings people together, and to do that credibly it has to get at their desires, habits, and revulsions. So you collect a lot of detailed data and work very hard to translate it all into general theories of human behavior. What a person develops working amidst all this information, as opposed to, say, working for the wedding section of the Sunday paper, is a special kinship with the shambling whole of humanity rather than with any two individuals. You grow to understand people much as a chemist might understand, and through understanding come to love, the swirling molecules of his tincture.
That said, all websites, and indeed all data scientists, objectify. Algorithms don’t work well with things that aren’t numbers, so when you want a computer to understand an idea, you have to convert as much of it as you can into digits. The challenge facing sites and apps is thus to chop and jam the continuum of human experience into little buckets 1, 2, 3, without anyone noticing: to divide some vast, ineffable process—for Facebook, friendship, for Reddit, community, for dating sites, love—into pieces a server can handle. At the same time you have to retain as much of the je ne sais quoi of the thing as you can, so the users believe what you’re offering represents real life. It’s a delicate illusion, the Internet; imagine a carrot sliced so cleanly that the pieces stay there in place on the cutting board, still in the shape of a carrot. And while this tension—between the continuity of the human condition and the fracture of the database—can make running a website complicated, it’s also what makes my story go. The approximations technology has devised for things like lust and friendship offer a truly novel opportunity: to put hard numbers to some timeless mysteries; to take experiences that we’ve been content to put aside as “unquantifiable” and instead gain some understanding. As the approximations have gotten better and better, and as people have allowed them further into their lives, that understanding has improved with startling speed. I’m going to give you a quick example, but I first want to say that “Making the Ineffable Totally Effable” really should’ve been OkCupid’s tagline. Alas.
Ratings are everywhere on the Internet. Whether it’s Reddit’s up/down votes, Amazon’s customer reviews, or even Facebook’s “like” button, websites ask you to vote because that vote turns something fluid and idiosyncratic—your opinion—into something they can understand and use. Dating sites ask people to rate one another because it lets them transform first impressions such as:
He’s got beautiful eyes
Hmmm, he’s cute, but I don’t like redheads
Ugh, gross
… into simple numbers, say, 5, 3, 1 on a five-star scale. Sites have collected billions of these microjudgments, one person’s snap opinion of someone else. Together, all those tiny thoughts form a source of vast insight into how people arrive at opinions of one another.
The most basic thing you can do with person-to-person ratings like this is count them up. Take a census of how many people averaged one star, two stars, and so on, and then compare the tallies. Below, I’ve done just that with the average votes given to straight women by straight men. This is the shape of the curve:
Fifty-one million preferences boil down to this simple stand of rectangles. It is, in essence, the collected male opinion of female beauty on OkCupid. It folds all the tiny stories (what a man thinks of a woman, millions of times over) and all the anecdotes (any one of which we could’ve expanded upon, were this a different kind of book) into an intelligible whole. Looking at people like this is like looking at Earth from space; you lose the detail, but you get to see something familiar in a totally new way.
So what is thi
s curve telling us? It’s easy to take this basic shape—a bell curve—for granted, because examples in textbooks have probably led you to expect it, but the scores could easily have gone hard to one side or the other. When personal preference is involved, they often do. Take ratings of pizza joints on Foursquare, which tend to be very positive:
Or take the recent approval ratings for Congress, which, because politicians are the moral opposite of pizza, skew the other way:
Also, our male-to-female ratings curve is unimodal, meaning that the women’s scores tend to cluster around a single value. This again is easy to shrug at, but many situations have multiple modes, or “typical” values. If you plot NBA players by how often they were in the starting lineup in the 2012–13 season, you get a bunch of athletes clustered at either end, and almost no one in the middle:
That’s the data telling us that coaches think a given player is either good enough to start, or he isn’t, and the guy’s in or out of the lineup accordingly. There’s a clear binary system. Similarly, in our ratings data, men as a group might’ve seen women as “gorgeous” or “ugly” and left it at that; like top-line basketball talent, beauty could’ve been a you-have-it-or-you-don’t kind of thing. But the curve we started with says something else. Looking for understanding in data is often a matter of considering your results against these kinds of counterfactuals. Sometimes, in the face of an infinity of alternatives, a straightforward result is all the more remarkable for being so. In fact, our graph is quite close to what’s called a symmetric beta distribution—a curve often deployed to model basic unbiased decisions—which I’ll overlay here:
Our real-world data diverges only slightly (6 percent) from this formulaic ideal, meaning this graph of male desire is more or less what we could’ve guessed in a vacuum: it is, in fact, one of those textbook examples I was making light of. So the curve is predictable, centered—maybe even boring. So what? Well, this is a rare context where boringness is something special: it implies that the individual men who did the scoring are likewise predictable, centered, and, above all, unbiased. And when you consider the supermodels, the porn, the cover girls, the Lara Croft–style fembots, the Bud Light ads, and, most devious of all, the Photoshop jobs that surely these men see every day, the fact that male opinion of female attractiveness is still where it’s supposed to be is, by my lights, a small miracle. It’s practically common sense that men should have unrealistic expectations of women’s looks, and yet here we see it’s just not true. In any event, they’re far more generous than the women, whose votes go like this: