Data and Goliath

Home > Other > Data and Goliath > Page 4
Data and Goliath Page 4

by Bruce Schneier


  There’s more hidden surveillance going on in the streets. Those contactless RFID chip cards in your wallet can be used to track people. Many retail stores are surreptitiously tracking people by the MAC addresses and Bluetooth IDs—which are basically identification numbers—broadcast by their smartphones. The goal is to record which aisles they walk down, which products they stop to look at, and so on. People can be tracked at public events by means of both these approaches.

  In 2014, a senior executive from the Ford Motor Company told an audience at the Consumer Electronics Show, “We know everyone who breaks the law, we know when you’re doing it. We have GPS in your car, so we know what you’re doing.” This came as a shock and surprise, since no one knew Ford had its car owners under constant surveillance. The company quickly retracted the remarks, but the comments left a lot of wiggle room for Ford to collect data on its car owners. We know from a Government Accountability Office report that both automobile companies and navigational aid companies collect a lot of location data from their users.

  Radar in the terahertz range can detect concealed weapons on people, and objects through eight inches of concrete wall. Cameras can “listen” to phone conversations by focusing on nearby objects like potato chip bags and measuring their vibrations. The NSA, and presumably others, can turn your cell phone’s microphone on remotely, and listen to what’s going on around it.

  There are body odor recognition systems under development, too. On the Internet, one company is working on identifying people by their typing style. There’s research into identifying people by their writing style. Both corporations and governments are harvesting tens of millions of voiceprints—yet another way to identify you in real time.

  This is the future. Store clerks will know your name, address, and income level as soon as you walk through the door. Billboards will know who you are, and record how you respond to them. Grocery store shelves will know what you usually buy, and exactly how to entice you to buy more of it. Your car will know who is in it, who is driving, and what traffic laws that driver is following or ignoring. Even now, it feels a lot like science fiction.

  As surveillance fades into the background, it becomes easier to ignore. And the more intrusive a surveillance system is, the more likely it is to be hidden. Many of us would refuse a drug test before being hired for an office job, but many companies perform invasive background checks on all potential employees. Likewise, being tracked by hundreds of companies on the Internet—companies you’ve never interacted with or even heard of—feels much less intrusive than a hundred market researchers following us around taking notes.

  In a sense, we’re living in a unique time in history; many of our surveillance systems are still visible to us. Identity checks are common, but they still require us to show our ID. Cameras are everywhere, but we can still see them. In the near future, because these systems will be hidden, we may unknowingly acquiesce to even more surveillance.

  AUTOMATIC SURVEILLANCE

  A surprising amount of surveillance happens to us automatically, even if we do our best to opt out. It happens because we interact with others, and they’re being monitored.

  Even though I never post or friend anyone on Facebook—I have a professional page, but not a personal account—Facebook tracks me. It maintains a profile of non-Facebook users in its database. It tracks me whenever I visit a page with a Facebook “Like” button. It can probably make good guesses about who my friends are based on tagged photos, and it may well have the profile linked to other information it has purchased from various data brokers. My friends, and those sites with the Like buttons, allow Facebook to surveil me through them.

  I try not to use Google search. But Google still collects a lot of information about the websites I visit, because so many of them use Google Analytics to track their visitors. Again, those sites let Google track me through them. I use various blockers in my browser so Google can’t track me very well, but it’s working on technologies that will circumvent my privacy practices.

  I also don’t use Gmail. Instead, I use a local ISP and store all of my e-mail on my computer. Even so, Google has about a third of my messages, because many of the people I correspond with use Gmail. It’s not just Gmail.com addresses; Google hosts a lot of organizations’ e-mail, even though those organizations keep their domain name addresses. There are other examples. Apple has a worldwide database of Wi-Fi passwords, including my home network’s, from people backing up their iPhones. Many companies have my contact information because my friends and colleagues back up their address books in the cloud. If my sister publishes her genetic information, then half of mine becomes public as well.

  Sometimes data we only intend to share with a few becomes surveillance data for the world. Someone might take a picture of a friend at a party and post it on Facebook so her other friends can see it. Unless she specifies otherwise, that picture is public. It’s still hard to find, of course—until it’s tagged by an automatic face recognition system and indexed by a search engine. Now that photo can be easily found with an image search.

  I am constantly appearing on other people’s surveillance cameras. In cities like London, Chicago, Mexico City, and Beijing, the police forces have installed surveillance cameras all over the place. In other cities, like New York, the cameras are mostly privately owned. We saw the difference in two recent terrorism cases. The London subway bombers were identified by government cameras, and the Boston Marathon bombers by private cameras attached to businesses.

  That data is almost certainly digital. Often it’s just stored on the camera, on an endless loop that erases old data as it records new data. But increasingly, that surveillance video is available on the Internet and being saved indefinitely—and a lot of it is publicly searchable.

  Unless we take steps to prevent it, being captured on camera will get even less avoidable as life recorders become more prevalent. Once enough people regularly record video of what they are seeing, you’ll be in enough of their video footage that it’ll no longer matter whether or not you’re wearing one. It’s kind of like herd immunity, but in reverse.

  UBIQUITOUS SURVEILLANCE

  Philosopher Jeremy Bentham conceived of his “panopticon” in the late 1700s as a way to build cheaper prisons. His idea was a prison where every inmate could be surveilled at any time, unawares. The inmate would have no choice but to assume that he was always being watched, and would therefore conform. This idea has been used as a metaphor for mass personal data collection, both on the Internet and off.

  On the Internet, surveillance is ubiquitous. All of us are being watched, all the time, and that data is being stored forever. This is what an information-age surveillance state looks like, and it’s efficient beyond Bentham’s wildest dreams.

  3

  Analyzing Our Data

  In 2012, the New York Times published a story on how corporations analyze our data for advertising advantages. The article revealed that Target Corporation could determine from a woman’s buying patterns that she was pregnant, and would use that information to send the woman ads and coupons for baby-related items. The story included an anecdote about a Minneapolis man who’d complained to a Target store that had sent baby-related coupons to his teenage daughter, only to find out later that Target was right.

  The general practice of amassing and saving all kinds of data is called “big data,” and the science and engineering of extracting useful information from it is called “data mining.” Companies like Target mine data to focus their advertising. Barack Obama mined data extensively in his 2008 and 2012 presidential campaigns for the same purpose. Auto companies mine the data from your car to design better cars; municipalities mine data from roadside sensors to understand driving conditions. Our genetic data is mined for all sorts of medical research. Companies like Facebook and Twitter mine our data for advertising purposes, and have allowed academics to mine their data for social research.

  Most of these are secondary uses of the data. That i
s, they are not the reason the data was collected in the first place. In fact, that’s the basic promise of big data: save everything you can, and someday you’ll be able to figure out some use for it all.

  Big data sets derive value, in part, from the inferences that can be made from them. Some of these are obvious. If you have someone’s detailed location data over the course of a year, you can infer what his favorite restaurants are. If you have the list of people he calls and e-mails, you can infer who his friends are. If you have the list of Internet sites he visits—or maybe a list of books he’s purchased—you can infer his interests.

  Some inferences are more subtle. A list of someone’s grocery purchases might imply her ethnicity. Or her age and gender, and possibly religion. Or her medical history and drinking habits. Marketers are constantly looking for patterns that indicate someone is about to do something expensive, like get married, go on vacation, buy a home, have a child, and so on. Police in various countries use these patterns as evidence, either in a court or in secret. Facebook can predict race, personality, sexual orientation, political ideology, relationship status, and drug use on the basis of Like clicks alone. The company knows you’re engaged before you announce it, and gay before you come out—and its postings may reveal that to other people without your knowledge or permission. Depending on the country you live in, that could merely be a major personal embarrassment—or it could get you killed.

  There are a lot of errors in these inferences, as all of us who’ve seen Internet ads that are only vaguely interesting can attest. But when the ads are on track, they can be eerily creepy—and we often don’t like it. It’s one thing to see ads for hemorrhoid suppositories or services to help you find a girlfriend on television, where we know they’re being seen by everyone. But when we know they’re targeted at us specifically, based on what we’ve posted or liked on the Internet, it can feel much more invasive. This makes for an interesting tension: data we’re willing to share can imply conclusions that we don’t want to share. Many of us are happy to tell Target our buying patterns for discounts and notifications of new products we might like to buy, but most of us don’t want Target to figure out that we’re pregnant. We also don’t want the large data thefts and fraud that inevitably accompany these large databases.

  When we think of computers using all of our data to make inferences, we have a very human way of thinking about it. We imagine how we would make sense of the data, and project that process onto computers. But that’s not right. Computers and people have different strengths, weaknesses, and limits. Computers can’t abstractly reason nearly as well as people, but they can process enormous amounts of data ever more quickly. (If you think about it, this means that computers are better at working with metadata than they are at handling conversational data.) And they’re constantly improving; computing power is still doubling every eighteen months, while our species’ brain size has remained constant. Computers are already far better than people at processing quantitative data, and they will continue to improve.

  Right now, data mining is a hot technology, and there’s a lot of hype and opportunism around it. It’s not yet entirely clear what kinds of research will be possible, or what the true potential of the field is. But what is clear is that data-mining technology is becoming increasingly powerful and is enabling observers to draw ever more startling conclusions from big data sets.

  SURVEILLING BACKWARDS IN TIME

  One new thing you can do by applying data-mining technology to mass-surveillance data is go backwards in time. Traditional surveillance can only learn about the present and future: “Follow him and find out where he’s going next.” But if you have a database of historical surveillance information on everyone, you can do something new: “Look up that person’s location information, and find out where he’s been.” Or: “Listen to his phone calls from last week.”

  Some of this has always been possible. Historically, governments have collected all sorts of data about the past. In the McCarthy era, for example, the government used political party registrations, subscriptions to magazines, and testimonies from friends, neighbors, family, and colleagues to gather data on people. The difference now is that the capability is more like a Wayback Machine: the data is more complete and far cheaper to get, and the technology has evolved to enable sophisticated historical analysis.

  For example, in recent years Credit Suisse, Standard Chartered Bank, and BNP Paribas all admitted to violating laws prohibiting money transfer to sanctioned groups. They deliberately altered transactions to evade algorithmic surveillance and detection by “OFAC filters”—that’s the Office of Foreign Assets Control within the Department of the Treasury. Untangling this sort of wrongdoing involved a massive historical analysis of banking transactions and employee communications.

  Similarly, someone could go through old data with new analytical tools. Think about genetic data. There’s not yet a lot we can learn from someone’s genetic data, but ten years from now—who knows? We saw something similar happen during the Tour de France doping scandals; blood taken from riders years earlier was tested with new technologies, and widespread doping was detected.

  The NSA stores a lot of historical data, which I’ll talk about more in Chapter 5. We know that in 2008 a database called XKEYSCORE routinely held voice and e-mail content for just three days, but it held metadata for a month. One called MARINA holds a year’s worth of people’s browsing history. Another NSA database, MYSTIC, was able to store recordings of all the phone conversations for Bermuda. The NSA stores telephone metadata for five years.

  These storage limits pertain to the raw trove of all data gathered. If an NSA analyst touches something in the database, the agency saves it for much longer. If your data is the result of a query into these databases, your data is saved indefinitely. If you use encryption, your data is saved indefinitely. If you use certain keywords, your data is saved indefinitely.

  How long the NSA stores data is more a matter of storage capacity than a respect for privacy. We know the NSA needed to increase its storage capacity to hold all the cell phone location data it was collecting. As data storage gets cheaper, assume that more of this data will be stored longer. This is the point of the NSA’s Utah Data Center.

  The FBI stores our data, too. During the course of a legitimate investigation in 2013, the FBI obtained a copy of all the data on a site called Freedom Hosting, including stored e-mails. Almost all the data was unrelated to the investigation, but the FBI kept a copy of the entire site and has been accessing it for unrelated investigations ever since. The state of New York retains license plate scanning data for at least five years and possibly indefinitely.

  Any data—Facebook history, tweets, license plate scanner data—can basically be retained forever, or until the company or government agency decides to delete it. In 2010, different cell phone companies held text messages for durations ranging from 90 days to 18 months. AT&T beat them all, hanging on to the data for seven years.

  MAPPING RELATIONSHIPS

  Mass-surveillance data permits mapping of interpersonal relationships. In 2013, when we first learned that the NSA was collecting telephone calling metadata on every American, there was much ado about so-called hop searches and what they mean. They’re a new type of search, theoretically possible before computers but only really practical in a world of mass surveillance. Imagine that the NSA is interested in Alice. It will collect data on her, and then data on everyone she communicates with, and then data on everyone they communicate with, and then data on everyone they communicate with. That’s three hops away from Alice, which is the maximum the NSA worked with.

  The intent of hop searches is to map relationships and find conspiracies. Making sense of the data requires being able to cull out the overwhelming majority of innocent people who are caught in this dragnet, and the phone numbers common to unrelated people: voice mail services, pizza restaurants, taxi companies, and so on.

  NSA documents note that the agency had 117,675
“active surveillance targets” on one day in 2013. Even using conservative estimates of how many conversants each person has and how much they overlap, the total number of people being surveilled by this system easily exceeded 20 million. It’s the classic “six degrees of separation” problem; most of us are only a few hops away from everyone else. In 2014, President Obama directed the NSA to conduct two-hop analysis only on telephone metadata collected under one particular program, but he didn’t place any restrictions on NSA hops for all the other data it collects.

  Metadata from various sources is great for mapping relationships. Most of us use the Internet for social interaction, and our relationships show up in that. This is what both the NSA and Facebook do, and it’s why the latter is so unnervingly accurate when it suggests people you might know whom you’re not already Facebook friends with. One of Facebook’s most successful advertising programs involves showing ads not just to people who Like a particular page or product, but to their friends and to friends of their friends.

  FINDING US BY WHAT WE DO

  Once you have collected data on everyone, you can search for individuals based on their behavior. Maybe you want to find everyone who frequents a certain gay bar, or reads about a particular topic, or has a particular political belief. Corporations do this regularly, using masssurveillance data to find potential customers with particular characteristics, or looking for people to hire by searching for people who have published on a particular topic.

  One can search for things other than names and other personal identifiers like identification numbers, phone numbers, and so on. Google, for example, searches all of your Gmail and uses keywords it finds to more intimately understand you, for advertising purposes. The NSA does something similar: what it calls “about” searches. Basically, it searches the contents of everyone’s communications for a particular name or word—or maybe a phrase. So in addition to examining Alice’s data and the data of everyone within two or three hops of her, it can search everyone else—the entire database of communications—for mentions of her name. Or if it doesn’t know a name, but knows the name of a particular location or project, or a code name that someone has used, it can search on that. For example, the NSA targets people who search for information on popular Internet privacy and anonymity tools.

 

‹ Prev