by Hannah Fry
The Wild West
Palantir Technologies is one of the most successful Silicon Valley start-ups of all time. It was founded in 2003 by Peter Thiel (of PayPal fame), and at the last count was estimated to be worth a staggering $20 billion.7 That’s about the same market value as Twitter, although chances are you’ve never heard of it. And yet – trust me when I tell you – Palantir has most certainly heard of you.
Palantir is just one example of a new breed of companies known as data brokers, who buy and collect people’s personal information and then resell it or share it for profit. There are plenty of others: Acxiom, Corelogic, Datalogix, eBureau – a swathe of huge companies you’ve probably never directly interacted with, that are none the less continually monitoring and analysing your behaviour.8
Every time you shop online, every time you sign up for a newsletter, or register on a website, or enquire about a new car, or fill out a warranty card, or buy a new home, or register to vote – every time you hand over any data at all – your information is being collected and sold to a data broker. Remember when you told an estate agent what kind of property you were looking for? Sold to a data broker. Or those details you once typed into an insurance comparison website? Sold to a data broker. In some cases, even your entire browser history can be bundled up and sold on.9
It’s the broker’s job to combine all of that data, cross-referencing the different pieces of information they’ve bought and acquired, and then create a single detailed file on you: a data profile of your digital shadow. In the most literal sense, within some of these brokers’ databases, you could open up a digital file with your ID number on it (an ID you’ll never be told) that contains traces of everything you’ve ever done. Your name, your date of birth, your religious affiliation, your vacation habits, your credit-card usage, your net worth, your weight, your height, your political affiliation, your gambling habits, your disabilities, the medication you use, whether you’ve had an abortion, whether your parents are divorced, whether you’re easily addictable, whether you are a rape victim, your opinions on gun control, your projected sexual orientation, your real sexual orientation, and your gullibility. There are thousands and thousands of details within thousands and thousands of categories and files stored on hidden servers somewhere, for virtually every single one of us.10
Like Target’s pregnancy predictions, much of this data is inferred. A subscription to Wired magazine might imply that you’re interested in technology; a firearms licence might imply that you’re interested in hunting. All along the way, the brokers are using clever, but simple, algorithms to enrich their data. It’s exactly what the supermarkets were doing, but on a massive scale.
And there are plenty of benefits to be had. Data brokers use their understanding of who we are to prevent fraudsters from impersonating unsuspecting consumers. Likewise, knowing our likes and dislikes means that the adverts we’re served as we wander around the internet are as relevant to our interests and needs as possible. That almost certainly makes for a more pleasant experience than being hit with mass market adverts for injury lawyers or PPI claims day after day. Plus, because the messages can be directly targeted on the right consumers, it means advertising is cheaper overall, so small businesses with great products can reach new audiences, something that’s good for everyone.
But, as I’m sure you’re already thinking, there’s also an array of problems that arise once you start distilling who we are as people down into a series of categories. I’ll get on to that in a moment, but first I think it’s worth briefly explaining the invisible process behind how an online advert reaches you when you’re clicking around on the internet, and the role that a data broker plays in the process.
So, let’s imagine I own a luxury travel company, imaginatively called Fry’s. Over the years, I have been getting people to register their interest on my website and now have a list of their email addresses. If I wanted to find out more about my users – like what kind of holidays they were interested in – I could send off my list of users’ emails to a data broker, who would look up the names in their system, and return my list with the relevant data attached. Sort of like adding an extra column on to a spreadsheet. Now when you visit my Fry’s website, I can see that you have a particular penchant for tropical islands and so serve you up an advert for a Hawaii getaway.
That’s option one. In option two, let’s imagine that Fry’s has a little extra space on its website that we’re willing to sell to other advertisers. Again, I contact a data broker and give them the information I have on my users. The broker looks for other companies who want to place adverts. And, for the sake of the story, let’s imagine that a company selling sun cream is keen. To persuade them that Fry’s has the audience the sun-cream seller would want to target, the broker could show them some inferred characteristics of Fry’s users: perhaps the percentage of people with red hair, that kind of thing. Or the sun-cream seller could hand over a list of its own users’ email addresses and the broker could work out exactly how much crossover there was between the audiences. If the sun-cream seller agrees, the advert appears on Fry’s website – and the broker and I both get paid.
So far, these methods don’t go much beyond the techniques that marketers have always used to target customers. But it’s option three where, for me, things start to get a little bit creepy. This time, Fry’s is looking for some new customers. I want to target men and women over 65 who like tropical islands and have large disposable incomes, in the hope that they’ll want to go on one of our luxurious new Caribbean cruises. Off I go to a data broker who will look through their database and find me a list of people who match my description.
So, let’s imagine you are on that list. The broker will never share your name with Fry’s. But they will work out which other websites you regularly use. Chances are, the broker will also have a relationship with one of your favourites. Maybe a social media site, or a news website, something along those lines. As soon as you unsuspectingly log into your favourite website, the broker will get a ping to alert them to the fact that you’re there. Virtually instantaneously, the broker will respond by placing a little tiny flag – known as a cookie – on your computer. This cookiefn1 acts like a signal to all kinds of other websites around the internet, saying that you are someone who should be served up an advert for Fry’s Caribbean cruises. Whether you want them or not, wherever you go on the internet, those adverts will follow you.
And here we stumble on the first problem. What if you don’t want to see the advert? Sure, being bombarded with images of Caribbean cruises might be little more than a minor inconvenience, but there are other adverts which can have a much more profound impact on a person.
When Heidi Waterhouse lost a much-wanted pregnancy,11 she unsubscribed from all the weekly emails updating her on her baby’s growth, telling her which fruit the foetus now matched in size. She unsubscribed from all the mailing lists and wish lists she had signed up to in eager anticipation of the birth. But, as she told an audience of developers at a conference in 2018, there was no power on earth that could unsubscribe her from the pregnancy adverts that followed her around the internet. This digital shadow of a pregnancy continued to circulate alone, without the mother or the baby. ‘Nobody who built that system thought of that consequence,’ she explained.
It’s a system which, thanks to either thoughtless omission or deliberate design, has the potential to be exploitative. Payday lenders can use it to directly target people with bad credit ratings; betting adverts can be directed to people who frequent gambling websites. And there are concerns about this kind of data profiling being used against people, too: motorbike enthusiasts being deemed to have a risky hobby, or people who eat sugar-free sweets being flagged as diabetic and turned down for insurance as a result. A study from 2015 demonstrated that Google was serving far fewer ads for high-paying executive jobs to women who were surfing the web than to men.12 And, after one African American Harvard professor learned that Googling her own name r
eturned adverts targeted on people with a criminal record (and as a result was forced to prove to a potential employer that she’d never been in trouble with the police), she began researching the adverts delivered to different ethnic groups. She discovered that searches for ‘black-sounding names’ were disproportionately likely to be linked to adverts containing the word ‘arrest’ (e.g. ‘Have you been arrested?’) than those with ‘white-sounding names’.13
These methods aren’t confined to data brokers. There’s very little difference between how they work and how Google, Facebook, Instagram and Twitter operate. These internet giants don’t make money by having users, so their business models are based on the idea of micro-targeting. They are gigantic engines for delivering adverts, making money by having their millions of users actively engaged on their websites, clicking around, reading sponsored posts, watching sponsored videos, looking at sponsored photos. In whatever corner of the internet you use, hiding in the background, these algorithms are trading on information you didn’t know they had and never willingly offered. They have made your most personal, private secrets into a commodity.
Unfortunately, in many countries, the law doesn’t do much to protect you. Data brokers are largely unregulated and – particularly in America – opportunities to curb their power have repeatedly been passed over by government. In March 2017, for instance, the US Senate voted to eliminate rules that would have prevented data brokers from selling your internet browser history without your explicit consent. Those rules had previously been approved in October 2016 by the Federal Communications Commission; but, after the change in government at the end of that year, they were opposed by the FCC’s new Republican majority and Republicans in Congress.14
So what does all this mean for your privacy? Well, let me tell you about an investigation led by German journalist Svea Eckert and data scientist Andreas Dewes that should give you a clear idea.15
Eckert and her team set up a fake data broker and used it to buy the anonymous browsing data of 3 million German citizens. (Getting hold of people’s internet histories was easy. Plenty of companies had an abundance of that kind of data for sale on British or US customers – the only challenge was finding data focused on Germany.) The data itself had been gathered by a Google Chrome plugin that users had willingly downloaded, completely unaware that it was spying on them in the process.fn2
In total, it amounted to a gigantic list of URLs. A record of everything those people had looked at online over the course of a month. Every search, every page, every click. All legally put up for sale.
For Eckert and her colleagues, the only problem was that the browser data was anonymous. Good news for all the people whose histories had been sold. Right? Should save their blushes. Wrong. As the team explained in a presentation at DEFCON in 2017, de-anonymizing huge databases of browser history was spectacularly easy.
Here’s how it worked. Sometimes there were direct clues to the person’s identity in the URLs themselves. Like anyone who visited Xing.com, the German equivalent of LinkedIn. If you click on your profile picture on the Xing website, you are sent through to a page with an address that will be something like the following:
www.xing.com/profile/Hannah_Fry?sc_omxb_p
Instantly, the name gives you away, while the text after the username signifies that the user is logged in and viewing their own profile, so the team could be certain that the individual was looking at their own page. It was a similar story with Twitter. Anyone checking their own Twitter analytics page was revealing themselves to the team in the process. For those without an instant identifier in their data, the team had another trick up their sleeve. Anyone who posted a link online – perhaps by tweeting about a website, or sharing their public playlist on YouTube – essentially, anyone who left a public trace of their data shadow attached to their real name, was inadvertently unmasking themselves in the process. The team used a simple algorithm to cross-reference the public and anonymized personas,16 filtering their list of URLs to find someone in the dataset who had visited the same websites at the same times and dates that the links were posted online. Eventually, they had the full names of virtually everyone in the dataset, and full access to a month’s worth of complete browsing history for millions of Germans as a result.
Among those 3 million people were several high-profile individuals. They included a politician who had been searching for medication online. A police officer who had copied and pasted a sensitive case document into Google Translate, all the details of which then appeared in the URL and were visible to the researchers. And a judge, whose browsing history showed a daily visit to one rather specific area of the internet. Here is a small selection of the websites he visited during one eight-minute period in August 2016:
18.22: http://www.tubegalore.com/video/amature-pov–ex-wife-in-leather-pants-gets-creampie42945.html
18.23: http://www.xxkingtube.com/video/pov_wifey_on_sex_stool_with_beaded_thong_gets_creampie_4814.html
18.24: http://de.xhamster.com/movies/924590/office_lady_in_pants_rubbing_riding_best_of_anlife.html
18.27: http://www.tubegalore.com/young_tube/5762–1/page0
18.30: http://www.keezmovies.com/video/sexy-dominatrix-milks-him-dry-1007114?utm_sources
In among these daily browsing sessions, the judge was also regularly searching for baby names, strollers and maternity hospitals online. The team concluded that his partner was expecting a baby at the time.
Now, let’s be clear here: this judge wasn’t doing anything illegal. Many – myself included – would argue that he wasn’t doing anything wrong at all. But this material would none the less be useful in the hands of someone who wanted to blackmail him or embarrass his family.
And that is where we start to stray very far over the creepy line. When private, sensitive information about you, gathered without your knowledge, is then used to manipulate you. Which, of course, is precisely what happened with the British political consulting firm Cambridge Analytica.
Cambridge Analytica
You probably know most of the story by now.
Since the 1980s, psychologists have been using a system of five characteristics to quantify an individual’s personality. You get a score on each of the following traits: openness to experience, conscientiousness, extraversion, agreeableness and neuroticism. Collectively, they offer a standard and useful way to describe what kind of a person you are.
Back in 2012, a year before Cambridge Analytica came on the scene, a group of scientists from the University of Cambridge and Stanford University began looking for a link between the five personality traits and the pages people ‘liked’ on Facebook.17 They built a Facebook quiz with this purpose in mind, allowing users to take real psychometric tests, while hoping to find a connection between a person’s true character and their online persona. People who downloaded their quiz knowingly handed over data on both: the history of their Likes on Facebook and, through a series of questions, their true personality scores.
It’s easy to imagine how Likes and personality might be related. As the team pointed out in the paper they published the following year,18 people who like Salvador Dalí, meditation or TED talks are almost certainly going to score highly on openness to experience. Meanwhile, people who like partying, dancing and Snooki from the TV series Jersey Shore tend to be a bit more extraverted. The research was a success. With a connection established, the team built an algorithm that could infer someone’s personality from their Facebook Likes alone.
By the time their second study appeared in 2014,19 the research team were claiming that if you could collect 300 Likes from someone’s Facebook profile, the algorithm would be able to judge their character more accurately than their spouse could.
Fast-forward to today, and the academic research group – the Psychometrics Centre at Cambridge University – have extended their algorithm to make personality predictions from your Twitter feed too. They have a website, open to anyone, where you can try it for yourself. Since my Twitter profi
le is open to the public anyway, I thought I’d try out the researchers’ predictions myself, so uploaded my Twitter history and filled out a traditional questionnaire-based personality study to compare. The algorithm managed to assess me accurately on three of the five traits. Although, as it turns out, according to the traditional personality study I am much more extraverted and neurotic than my Twitter profile makes it seem.fn3
All this work was motivated by how it could be used in advertising. So, by 2017,20 the same team of academics had moved on to experimenting with sending out adverts tailored to an individual’s personality traits. Using the Facebook platform, the team served up adverts for a beauty product to extraverts using the slogan ‘Dance like no one’s watching (but they totally are)’, while introverts saw an image of a girl smiling and standing in front of the mirror with the phrase ‘Beauty doesn’t have to shout.’
In a parallel experiment, targets high in openness-to-experience were shown adverts for crossword puzzles using an image with the text: ‘Aristoteles? The Seychelles? Unleash your creativity and challenge your imagination with an unlimited number of crossword puzzles!’ The same puzzles were advertised to people low in openness, but using instead the wording: ‘Settle in with an all-time favorite! The crossword puzzle that has challenged players for generations.’ Overall, the team claimed that matching adverts to a person’s character led to 40 per cent more clicks and up to 50 per cent more purchases than using generic, unpersonalized ads. For an advertiser, that’s pretty impressive.
All the while, as the academics were publishing their work, others were implementing their methods. Among them, so it is alleged, was Cambridge Analytica during their work for Trump’s election campaign.
Now, let’s backtrack slightly. There is little doubt that Cambridge Analytica were using the same techniques as my imaginary luxury travel agency Fry’s. Their approach was to identify small groups of people who they believed to be persuadable and target them directly, rather than send out blanket advertising. As an example they discovered that there was a large degree of overlap between people who bought good, American-made Ford motor cars and people who were registered as Republican party supporters. So they then set about finding people who had a preference for Ford, but weren’t known Republican voters, to see if they could sway their opinions using all-American adverts that tapped into that patriotic emotion. In some sense, this is no different from a candidate identifying a particular neighbourhood of swing voters and going door-to-door to persuade them one by one. And online, it’s no different from what Obama and Clinton were doing during their campaigns. Every major political party in the Western world uses extensive analysis and micro-targeting of voters.