by Adam Tanner
“The dirty little secret . . . whenever people talk about privacy rights, it always devolves to health data, and most lawyers and most of the public cannot believe that we have no control over our health data. But we don’t! We fucking don’t.”
Good Data Intentions Gone Bad: Netflix/AOL Data Releases
In recent years, the ability to identify people thought to be anonymous has embarrassed well-known companies and institutions that have released data. One well-known incident occurred in 2006 after Netflix announced a dramatic contest: the company offered to pay $1 million to anyone who could improve its movie rating system by 10 percent. In theory, the plan would benefit everyone: predicting what movies would most appeal to individual subscribers could boost business and create a better customer experience. By outsourcing the effort to the public, Netflix could lure some of the best minds in data science and fill the wallet of a computer expert or team of researchers. “We’re quite curious, really. To the tune of one million dollars,” the company advertised.22 The contest attracted wide attention. Hoping to snag the seven-figure payout, 51,051 people on 41,305 teams entered. Netflix released recommendations from nearly half a million subscribers, replacing the names of its customers with internal ID numbers.
The contest excited Arvind Narayanan, but for a different reason than other researchers. A University of Texas PhD student at the time, Narayanan did not aspire to win the million-dollar prize. Rather, he invented his own contest to reidentify some of the people whose names Netflix had removed when releasing the recommendations. He rushed to see his faculty advisor, Vitaly Shmatikov, a computer scientist with special interest in computer security and privacy.
Narayanan was convinced that Netflix was wrong in saying it could maintain the anonymity of customers in the released data. He felt confident that there had to be a way to identify some of them. If Narayanan and Shmatikov could succeed, they would demonstrate a major flaw in how companies approached protecting privacy as crowd-sourcing became increasingly popular.
Both foreign-born, the student and professor had grown up in very different privacy environments. Narayanan hailed from Chennai, earlier known as Madras, a city of more than four million in southern India. As in other large Indian cities, surges of people crowded the streets and public transport, leaving little possibility for personal space. “I like to joke that it is not even feasible in India because if you insisted that everybody stay three feet apart from each other, you’d run out of space,” he says. “When an Indian person applies for a job they put their date of birth and a bunch of other personal details on their CV, which is very jarring in terms of the kind of contextual boundaries that we have here.”
Shmatikov, who came to the United States in 1992, grew up in Moscow during Soviet Communism’s final years. The KGB and other state organs could monitor citizens of interest, but most Muscovites shuttled about the gray city anonymously, minding their own business among the masses. Muscovites could escape notice riding in the crowded Metro system or walking in Gorky Park. In some sense it was easier to be anonymous then because there was no massive data collection, and people weren’t leaving digital traces all over the place.
“Of course, if they really wanted to track someone, they had no lack of manpower, they could always assign a man to follow you,” Shmatikov says. “But you could do it for one person, for ten people, for a hundred people, you cannot do it for ten million people. So in that sense the vast majority of the population could be as anonymous as they wanted to be. Now it is all very different because now there really is technical capability to track anything anyone is doing anywhere.”
At first glance, it might appear unlikely that two researchers could identify people who posted anonymous movie reviews on Netflix. Many people watch the same popular movies. Yet some people watch and review those popular movies in combination with obscure ones, creating distinct profiles. Another analogy: any two random humans share, on average, 99.9 percent of their DNA: all human variation (and identifiability) is attributable to the remaining 0.1 percent. Such combinations provide clues that can help unmask a person’s identity, much as a contestant on the television game show Wheel of Fortune pieces together an entire message from partially revealed letters in a sentence. Some people in the Netflix prize dataset had watched and rated more than a thousand movies. Some had even rated more than ten thousand of the seventeen thousand movies then in the collection.
Narayanan came to learn that some cinephiles watch multiple movies a day and freely share their opinions on different sites, including imdb.com, where people often give their names when reviewing movies. Over the first eight days, Narayanan and Shmatikov worked feverishly into the night. By piecing together the names from IMDB with the same sets of movies in the Netflix prize dataset, they identified two Netflix subscribers by name, showing that they could solve the puzzle. They felt no need to go further: they had shown they could reidentify the movie lovers. “We were confident because there were no other matches that were even close,” Narayanan said. “Out of all the other records in the 500,000 dataset there was one good match” for each reidentification. As a further check, they reidentified two colleagues who had shared their Netflix viewing data with them, so in those two cases they knew for sure that their method worked and that they had found the right people.
The findings illustrated the privacy dangers that massive amounts of personal data pose, even if stripped of names. Yet academics initially shunned their findings. Narayanan and Shmatikov offered a paper on their research to an academic conference and received a thumbs down. “It is well known that logs can leak lots of private data,” one reviewer said in a rejection note. “It’s not clear whether there is much real novelty/research in this paper.”23 A second conference also said no. Finally, a year later, the same conference that first rejected the paper accepted a revised version. This time, the study received wide public attention—although that did not lead to riches. The million-dollar prize went to a team of data scientists who had come up with a 10.06 percent improvement on the Netflix movie recommendation system—three years after the company first revealed the subscriber recommendations.24
The public attention to the privacy implications of the Netflix data eventually led to a class-action lawsuit. In that complaint, a lesbian said she did not want to reveal her sexual orientation or interest in gay-themed films. “On October 2, 2006, Netflix perpetrated the largest voluntary privacy breach to date, disclosing sensitive and personal identifying consumer information,” the lawsuit said. “The information was not compromised by malicious intruders. Rather, it was given away to the world freely, and with fanfare.”25 Netflix eventually settled the case out of court. In 2010 it canceled plans for a second contest. Today the company would rather forget the whole episode.26
* * *
A few months before Netflix released its movie recommendation data in 2006, email and Internet pioneer AOL published the search histories of 650,000 users over three months—a total of twenty million searches. The company removed the IP addresses of the computer making the searches and instead assigned a unique ID number to each user so that researchers could follow the search patterns. Since users often look for information related to where they live and give clues about their identity over a period of time, two New York Times reporters succeeded in puzzling out the names of some of them.27
After a public outcry, AOL fired the official who released the data; the company’s chief technology officer resigned. AOL quickly tried to remove the data, yet the company suffered a damaging blow to its reputation—and a costly lawsuit. Only in 2013 did a federal judge approve the class-action settlement, which cost AOL up to $5 million, plus $930,000 to cover plaintiff’s attorney fees. People whose search data were released received $50 to $100.
Even today, one can still download the dataset on the Internet, again showing that once released, information can never be put back into the bottle. “It was a big reminder of the beginnings of what people now refer to as b
ig data,” AOL cofounder Steve Case reflected. “Data that was supposed to be helpful some people were able to use in a way that was not helpful. So it was a wake-up call to our business.28
“These issues are not new issues. What is new is that far more people are online, they are online far more habitually, far more networked, far more places, so therefore there is more tracking of data and more ability to kind of analyze it in ways that can be helpful and also ways that may not be helpful.”
Unmasking sexual orientation or reidentifying people based on clues are obviously far from the business of running a casino, or luring guests into a department store, or any other business. The larger point is that many of the services we enjoy today in different areas of our lives collect data about us. Watching cable television, carrying a cell phone, using social media sites, or visiting a doctor all generate data that are shared widely, even if not with the person generating the data. Much of the information you generate is fairly innocuous. Your hobbies. Your favorite music. Your photos. Any one piece of data would not reveal very much. But continued advances in data mining have made small bits of personal data ever more revealing when combined—and ever more valuable to companies.
Sometimes, these clues lead all the way to the naked truth.
10
The Hunt for a Mystery Woman
Scanty Clues
A Yelp page reviewing Instant Checkmate, in a section called “about the business,” showed the image of a smiling woman.1 It described her as Kristen B., manager of Instant Checkmate, followed by: “I’m Kristen, customer relations director at Instant Checkmate. When I am not responding to facebook [sic] messages, tweets, linked in requests and such, I can be found blogging on various sites. I love my job at Instant Checkmate and I am proud of help [sic] our customers!”
In Las Vegas–datelined press releases and company blogs, the company seemed to leave the talking to Kristen Bright, described as a PR manager, public relations specialist, social media consultant, or spokeswoman.2 Yet Kristen Bright did not respond to any attempts to contact her, either by phone or through email. Operators at the company’s Las Vegas call center said they had never seen her. So I wondered: could a minuscule blurry photograph provide enough information to identify and locate the real woman behind the image?
I copied the photo and loaded it into Google’s image search page. The results led to a photo of a woman on a boat with a bikini top stretched over significant cleavage. Someone had cropped the face from this image and put it on the Yelp page. Running a new search on the full photograph led to different pages with the same woman, photographed in a bikini or sexy underwear. On occasion she wore no top at all.
One view of the mystery woman I was trying to find. Source: Ann, surname withheld at her request.
The homemade, snapshot quality of the photos and a winning smile suggested a certain wholesomeness, even when she posed partially naked. Some tame family photos showed her with a boy, perhaps her son. A few bloggers had created pages in her honor, and admirers wrote in to compliment her beauty and curves. Some wondered where one might be able to find more images. Some blog comments referred to unseen explicit videos. The hunt continued.
The initial Google searches offered several names for the woman, with at least two surnames. Those names helped find other saucy images but no contact details, suggesting she used a stage moniker. The other images did not mention the name Kristen Bright. Searching through the new racy images did lead to a 2010 blog post showing her with a man who described himself as her husband, Tom. He said they were both thirty-eight years old and heading to a Jamaican resort for their twentieth wedding anniversary. For all these clues, the woman’s real name and contact details remained elusive.
Then one day I conducted a search through a background data broker site used by lawyers, insurance companies, law enforcement agencies, and others. I found her stage surname embedded in a man’s email address. That man, Tom, was then forty-one, about right for the husband if he had listed his true age in the Jamaica vacation posting a few years earlier. Tom had also filed a relatively recent Chapter 13 joint bankruptcy petition with his wife, whose name was Ann. Among the debts listed on the court documents: $131 owed to Victoria’s Secret. Might the lingerie chain be the source of some of the skimpy garb modeled in the online photos?
Those documents led to an address and phone number in California. Still, additional proof was needed before calling. After all, lots of couples named Tom and Ann live in the United States. A call to the wrong couple asking about naked photographs might provoke a justifiably angry response. A search of Ann’s real surname linked her to a Los Angeles–area high school where she had worked as a secretary. Deep within the school website lurked an old school newsletter with a photo of the support staff. Standing a bit shyly to the back of the group was a woman wearing a black vest over a white T-shirt. The face looked the same and the top-heavy body dimensions suggested she was indeed Ann.
I dialed a phone number I located for the couple and left a message, saying I was a fellow at Harvard University researching a book. An astonished Ann and Tom called back a few minutes later, wondering how they had been found. I told them about the Yelp listing for Instant Checkmate using her image. Had the website contacted her to gain permission to use her image as the face of the company? It had taken me a long time to find her. If she indeed worked for the data broker, the search would only have shown that she used a different name on the job. But she said not only did she not work for Instant Checkmate, she had never even heard of it or the name Kristen Bright. “Honestly, it’s a little sickening,” she said. Then she joked: “Geez, if they would have asked, I could have sold a better photo!”3
Ann’s story shows that in the Internet era it is possible to piece together clues about one’s true identity with just a little information. Her story turned out to be particularly saucy. Ann and Tom had tried to live a secret alternative erotic life on the Internet, a folly that ended up causing her great embarrassment.
* * *
One spring afternoon in 2011, a California high school principal called a secretary into his office. As Ann entered, she saw the school policeman, complete with gun in holster, seated with the principal around a large conference table. “This is kind of weird,” the principal said. “I wanted the deputy here because I wanted someone else with me so that we wouldn’t be alone in my office when I told you this.”
The head of the school told her they had found a compromising video in which she appeared. “Well, you’re not in trouble or anything. We are mostly concerned about how your son will react,” he said.
The police deputy had a copy of an explicit video on his cell phone.
“Would you like to see what video we are talking about?”
Ann looked around the room. She glanced at a picture of the principal shaking hands with Ronald Reagan and banners showing the school mascot. With horror her eyes fixed upon the wide flat-screen television mounted on the wall. She certainly did not want them replaying her exploits writ large. Her face flushed red, her embarrassment complete.
“No, my husband just called me and told me about it, so I know which video you are talking about.”
It had all started out so innocently. In the early 2000s, Ann and Tom started posting family snapshots on the Internet for family and friends to see. They were slightly ahead of their time, posting on a public site a few years before Mark Zuckerberg talked to his Harvard professor Harry Lewis about building a prototype website for what became Facebook.
At first, the images portrayed tame everyday activities, some including their two children: the chili cook-off, an outing to a lake, a visit to Universal Studios, Father’s Day. But in 2003, the images began attracting compliments from strangers on their online guestbook, especially those showing Ann wearing a revealing top or a bikini.
She and her husband decided to post more revealing images on a password-protected site, inviting their fans to take a peek—without family members stumblin
g upon the photos. Tom got a buzz from the attention Ann was getting (“I have a HOT wife,” he boasted in one post), and Ann enjoyed the compliments. “We’d both get a little turned on or whatever by what people said. So maybe if we did put racier pictures up it would be hot or it would be sexy,” Ann says.
Ann posing for her husband. Source: Ann, surname withheld at her request.
It excited her that unknown people far away found her sexy and attractive, and this unseen, remote enthusiasm sparked up their love life. “Sometimes we would read them together on the weekends. We’d sit down and go through them and, of course, it was sexually charging to do that, to see what people were saying,” she says.
Fans would request specific images, and they would comply: “We’d get all excited and if the kids were not home, we would take more pictures here and go somewhere on the weekends and post them and wait to see what people thought, hear their comments.”
Eventually family finances tightened, so Tom and Ann went a step further. Their best friends at the time, another couple, suggested the plan, but Tom and Ann became enthusiastic accomplices. They set up their own website at midnight, January 1, 2006, and, for the first time, posted self-filmed sex videos of themselves. The other married couple also posted explicit film clips of themselves. They invited fans to join the site for $24.95 per month, or $59.95 for three months. Within hours, $5,000 in subscription payments had poured in. They split the revenue with the other couple and celebrated their overnight financial success.
Eventually shooting video for the site made marital intimacy feel like a chore. They faced constant interruptions to adjust the lighting, make sure they exposed good angles, change positions, or check the camera. On top of that, they realized they could not prevent other people from reposting their images. Ann’s popularity on the Internet was spreading too far, too fast. Without asking her permission, a dating site for those over forty used her photo, which annoyed her because she was then in her mid-thirties. One blog posted one of her images, apparently drawing the notice of a relative. “That would be my Aunt . . . kind of weird seeing stuff like this on the net,” her niece wrote.