Dataclysm: Who We Are (When We Think No One's Looking)
Page 12
OkCupid’s user-submitted profile essays are as close to personal self-summaries as you’ll find. The prompts are open-ended:
“My self-summary …”
“I’m really good at …”
“The first things people usually notice about me are …”
“I spend a lot of time thinking about …”
And insofar as people try to put their best foot forward, they’re not at all unlike college essays. I imagine many people approach them with the same sort of dread. There are no length restrictions, no guidelines but for the prompts. Altogether, people have given the site 3.2 billion words of self-description. Moreover, unlike other big hunks of text—say, what Google Books has collected—there are demographics behind every word: the age of the author, where she lives, her race, and so on. But deriving a group identity for, say, Asian women from the text isn’t quite as easy as counting up who types what the most, which for the most part is how we’ve looked at text so far in this book. Counting words just gets us this:
1. the
2. of
3. and
4. …
and so on down the line—basically that top 100 from the Oxford English Corpus we saw before. Asian women, white men, and all English speakers use the same pronouns and articles and prepositions to talk about themselves. To find out what’s actually special to a particular group, and to them alone, we have to sort the text a little differently.
I’ll use white men as my walk-through example, because I understand them the best. The first step is to separate those white guys’ essays from everyone else’s. Then, in the two sets of self-descriptions—white-guy and not—we order all the words and phrases in the texts by how frequently they appear. We put them into two lists, from most popular to least, and that gives us something like the chart below. I’ve pulled out three examples and put them in their correct places in the line; the full lists have about 360,000 phrases each:
Already we’re getting somewhere, but before we move on, there’s something a little misleading about these plots that I want to address while the list is still simple. No, it’s got nothing to do with Phish, though lord knows they’ve misled many. It’s that “pizza” and “the” appear to be mentioned almost the same number of times. Granted, pizza is the king of foods, but “the” is the absolute most popular word in the English language. And in our data, while “the” is in its rightful place at the top, “pizza” is seemingly right there with it, at the 98th percentile. This makes it feel like something is wrong either with my data or with my method, but the rankings of the words are correct. It’s just that humans use language in an odd way: we are always repeating ourselves. So a very few top-ranked words take up most of our writing. And, conversely, the frequency of a word falls off very quickly as you go even a small distance from “most popular.”
This counterintuitive relationship between the popularity of a word (its rank in a given vocabulary) and the number of times it appears is described by something called Zipf’s law, an observed statistical property of language that, like so much of the best math, lies somewhere between miracle and coincidence.1 It states that in any large body of text, a word’s popularity (its place in the lexicon, with 1 being the highest ranking) multiplied by the number of times it shows up, is the same for every word in the text. Or, very elegantly:
rank × number = constant
This law holds for the Bible, the collected lyrics of ’60s pop songs, the canonical corpus of English literature (the Oxford English Corpus), and it certainly holds for profile text. To see how well it works in practice even on a highly idiosyncratic body of writing, here’s the law applied to James Joyce’s Ulysses:2
word rank number of times it appears rank × number
’s 10 2,826 28,260
is 20 1,435 28,700
what 30 975 29,250
has 100 289 28,900
wife 200 140 28,000
Ireland 300 90 27,000
college 1,000 26 26,000
morn 5,000 5 25,000
builder 10,000 2 20,000
Zurich 29,055 1 29,055
The steady relationship between rank and number seems to be a property of the mind as much as of language—as you can see above, it accommodates arbitrary proper names, like “Ireland” and “Zurich,” and even words transcribed from dialect, like “ ’s.”
And as further evidence of its deep connection with the human experience, Zipf’s law also describes a wide variety of our social constructs: the sizes of cities, for example, and income distribution across a population. What it means for our purpose here is that because most of language is just a small body of repeated patterns, the use of a word drops off rapidly. “The” appears on nearly every profile. “Pizza” appears on about 1 in 14. “Phish,” even for white guys, for whom it ranks way up at the 80th percentile, appears in less than 1 in 200 profiles. Now that we understand how rankings and usage frequency compare, the next step is to use those rankings to our advantage.
Below, I’ve put the two lists at right angles, forming a square, and I have plotted the words inside it using their popularity rankings on the two lists as coordinates. I added some arrows around “Phish” to make it clear what I mean:
A word’s position here has dual meaning. The closer to the top it appears, the more popular it is with white guys. The farther toward the right, the more popular it is with everyone else. Adding a few more words to the chart will give you a sense of how the geometry translates before I zoom out to the full corpus:
I’ve added a diagonal, yet again, to show parity in the data. The words near the line are important to everyone equally. And the farther up and to the right the words go, the more universally important they are. But remember, we’re not looking for universals. We’re looking for particulars. We want to know what is special to the people we’re considering: here, white guys. For that we need to look to the upper left: the farther in that direction a word appears, the more often white men use it, and the less often everyone else does. In fact, the closer a word is to that remotest reach of white maleness, the top-left vertex of the square, the more it typifies them and only them. Imagine a dot all the way in the corner: to be there, the word would have to appear on every single white male profile and at the same time never appear anywhere else. At least as far as words in a self-summary go, that’s the platonic ideal of identity. This system, and that metric—distance from the upper-left corner—gives the data a way to speak to us, to help us understand how people are talking about themselves.
Because every data set has its quirks, researchers must often build tools from scratch, as we have here. Whenever you do this, it’s good to check your method against some familiar outcomes. Imagine a shipwright with a new boat: who knows what’ll happen once it’s out on the open ocean—so best to check for holes close to shore. Here, if we’d found “Kpop” (Korean pop) or “dreads” in the upper left, in my supposed corner of white-manhood, it would be a strong sign that either my data or my method was garbage. But as you can see, it’s working perfectly.
So, finally, here’s what the whole corpus of words and phrases looks like:
I’ve circled the dot closest to that upper-left corner: that’s the white-male-est thing a person can write about himself: my blue eyes. And getting a longer list of the things that uniquely define white men is just a matter of walking out from that vertex—for example, the thirty closest dots are the thirty things that are most typical. The geometry finds the clichés for us.
I’ve made plots like this for everyone in my data set, not just white guys, and using this same math I’ve gotten lists of their unique words and phrases, too. But before I move to listing all this, I want to make one important point. Walking through each combination of sex × ethnicity × orientation gives you 2 × 4 × 3 = 24 charts like the one above, and in all of them the mass of dots has this same tapered shape from bottom left to top right. That is, the farther a phrase goes into that upper-right corner, the closer to the di
agonal it gets. What that means is that we tend to agree on the things that are most important. As for the things we don’t agree on, I’ve listed them in detail below. I’ll start with the men:3
most typical words for …
white men black men Latinos Asian men
my blue eyes dreads colombian tall for an asian
blonde hair jill scott salsa merengue asians
ween haitian cumbia taiwanese
brown hair soca una taiwan
hunting and fishing neo soul merengue bachata cantonese
allman brothers jamie foxx mana infernal affairs
woodworking zane banda seoul
campfire paid in full puertorican infernal
redneck nigga colombia shanghai
dropkick murphys luther vandross gusta boba
they might be giants coldest winter puerto rican kbbq
brewing beer tyler perry tejano kpop
robert heinlein swagg corridos badminton
tom robbins jerome bachata merengue kimchi
townes dreadlocks hector chungking express
old crow medicine show spike lee espa chou
mystery science theater holla at me por viet
skis menace to society salsa bachata jiro
sailboat brotha aventura dash berlin
around a fire shottas english and spanish ucsd
caddyshack boomerang musica beijing
blond hair nigerian espa ol hk
bill bryson heartbeats como norwegian wood
wheelers anthony hamilton fiu jiro dreams of sushi
pogues gud pero lin
barenaked ladies wayans soledad philippines
mst3k dickey espanol noodle soup
truckers isley amor malaysian
jethro tull interracial muy for my next meal
canoe nigeria reggaeton gangnam style
Phish might’ve already given it away, but inside the white man rages a music festival for lumberjacks.
As for the other three lists, I had never heard of Zane or Anthony Hamilton or The Coldest Winter Ever or Chungking Express or Dash Berlin or a lot of the above before my scripts coughed them up, and I’m not going to pretend that a few minutes with Wikipedia can stand in for an understanding of a culture. These are users speaking in their own voice, and I’m going to let them do just that, but I will point out a few broad trends: white people differentiate themselves mostly by their hair and eyes, Asians by their country of origin, Latinos by their music. But because of the way the math is set up, the three non-white lists are evidence of cultures that I, as a white man, am not supposed to know. Of course, we’re all familiar with Spike Lee and Beijing and Shanghai, but these lists give us the “insiders’ ” view of a culture. It’s stuff an outsider can’t get from autocomplete, or in any other top-down way, because you can’t wonder at what you don’t realize is out there. “Why do Asian people like Norwegian Wood?” isn’t a stereotype because not enough non-Asians are familiar with the book (by Haruki Murakami) and movie. I thought it was just a Beatles song, and if before this chapter someone had asked me if I’d seen Norwegian Wood, I’d have said, “I don’t think they made videos back then.” The lists above are our shibboleths. As such, they are something no one could generate a priori, by typing things into Google Trends or by searching millions of hashtags. Sometimes, it takes a blind algorithm to really see the data.
Here are the lists for women. As you can see, they’re very similar in spirit to the male. Maybe a few more ballads.
most typical words for …
white women black women Asian women Latinas
my blue eyes soca taiwan latina
red hair and eric jerome dickey tall for an asian colombian
blonde hair and haitian philippines una
love to be outside imitation of life taiwanese cumbia
mudding zane beijing banda
campfire coldest winter ever coz tejano
four wheeling nigerian boba merengue bachata
phish interracial filipina gusta
hunting fishing rb and gospel cantonese puertorican
campfires five heartbeats asians colombia
green eyes and anita baker wong kar wai mana
redneck crooklyn shanghai vida
auburn neosoul seoul bachata merengue
ride horses octavia butler macarons amor
old crow medicine show housewives of atlanta viet musica
grateful dead luther vandross kimchi english and spanish
mountain goats zora for my next meal espanol
love country music but waiting to exhale singapore salsa merengue
gillian welch anthony hamilton malaysian todo
country girl chrisette hk por
christmas vacation locs malaysia mariachi
bill bryson outside my race noodle soup marc anthony
riding horses kem cambodian espa ol
eric church octavia norwegian wood novelas
barn real housewives of atlanta hong kong como
allman calypso chungking express pero
willie nelson know why the caged rachmaninoff venezuela
harley did i get married southeast asia soledad
brunette spike lee vienna mas
flogging molly braxton mandarin tacuba
I discovered in the course of working with it that the algorithm we used to make these lists is flexible. You can just as easily run the math in reverse. This gives you the antitheses of a group—the stuff they especially don’t talk about—which can be as illuminating as what they especially do. Here are the lists for the men; they are printed on a darker background to visually emphasize that these lists are the opposite of the previous ones. They are the words least used by these groups yet most used by everyone else, the negative space in our verbal Rorschach. The lists are worth reading all the way through:
most antithetical words for …
white men black men Asian men Latinos
slow jams borges sence southern accent
trey songz social distortion layed from the midwest
robin thicke tallest man on earth layed back ann arbor
smh gaslight anthem sence of humor midwestern
musiq snorkeling truck driver gumbo
merengue belle and sebastian 6′4 freakanomics
laker xkcd realy equity
ig diet coke anything else you wanna discworld
kevin hart surfboard like what u see shanghai
raised in nyc totoro and my son scallops
hip hop rap rb magnetic fields u like what u slopes
kpop gogol bordello care of my kids university of michigan
george lopez dropkick murphys makeing assessment
neo soul rebelution welder parentheses
rb and hip hop peru hunting fishing snowboarder
neyo horrible’s sing along blog care of my son nyt
knw wakeboarding wanna know anything else dominion
gud herzog else you wanna know msu
follow me my blue eyes raising my son ellipses
jordans guitar and sing ask and ill maple
handball dr horrible’s sing along comedys nigerian
soulchild coachella dnt kenya
ne yo dr horrible’s sing woman who wants john irving
bachata yo la tengo i’m a single father over a decade
basketball airborne toxic event somthing cheesesteaks
paid in full yosemite careing wall street journal
mos def talib feynman writting alternatively
mangas coppola and my daughter mistborn
abt wind up bird haveing weber
utada kar brown hair gravitate toward
The opposite-of-Latino list I found most surprising. Hispanic and white identities are often conflated by demographers; for example, the US Census has struggled for years to separate one from the other. But they can only use checkboxes on paper. Latinos’ “most typical” list above and their “opposite” one here define the extremes. That first gives you the furthest
reaches of Latin culture (music and language) and this second gives the “corn-fed” Midwestern white stereotype, which is one of the few white subcultures with no Latin influence. Also, please notice that the “least Asian” things are all misspellings, working-class occupations, and other underachievements, like single fatherhood. And of course there’s “64.”
The women’s lists are equally rich, and I again suggest you take in every word. There’s the awesome my name is Ashley in the Asian antitheses. And I have to say, as a point of professional pride—when you ask an algorithm “What aren’t black women talking about” and it tells you “tanning,” you know you did something right.
most antithetical words for …
white women black women Asian women Latinas
filipino belle and sebastian bbw midwestern
neo soul tanning god my children cincinnati
musiq bruins single mother of two classically
slow jams tahoe grandson kenya
rich dad poor dad simon and garfunkel god my daughter neal
corinne bailey rae magnetic fields mother of three shanghai
bailey rae sf giants human services financial services
salsa bachata flogging molly degree in criminal justice classically trained
aaliyah head and the heart single mom of two southern belle
jpop dodgers notice my eyes and cutting for stone
smh wavy wanna know just ask in new england
salsa merengue naked and famous mexican and chinese antarctica
nujabes social distortion they are my world kavalier
48 laws of power mountain biking being the best mom full disclosure