The Numerati

Home > Other > The Numerati > Page 12
The Numerati Page 12

by Stephen Baker


  But one big problem reared its head that spring. As the snow began to recede from Flatiron, Nicolov and others began to see a new and hazardous specimen showing up in their results. Spam blogs, or splogs, they called them.

  The purpose of splogs was to use the immense power of Google to cash in on the fast-growing field of blog advertising. Google offered a service called Adsense. If you signed up for it, Google would automatically place relevant advertisements onto your blog or Web page. If you wrote about weddings, the system would detect this and drop in ad banners, say, for flowers, gowns, and tuxedos. If a reader clicked the banner, the advertiser would pay Google a few cents, and Google would share the take with the blogger. For bloggers, it looked like a great way to bring in advertising revenue with absolutely no sales staff. Just click the box, blog energetically, and wait for the check from Google. But when I surveyed bloggers that spring and asked how they were faring, most of them complained. The checks usually weren’t enough to keep them caffeinated, much less housed and fed.

  Robots, it turned out, were running away with much of the money. These software programs spawned blogs by the hundreds of thousands (many of them on Google’s free blog service) and engineered them to attract Google ads. These splogs then circulated with all the human blogs and elbowed aside millions of them to harvest valuable clicks. Here’s how. Picture a future bride looking for a wedding blog. She types “wedding” into the query box at a blog search engine. The most recent post with that word in it appears at the top of the blog results. She clicks it. Chances are, she’s disappointed to see a dog’s breakfast of sentence fragments featuring the word wedding. The blog is utter trash. It’s engineered by an automatic program not to be read, but simply to entice Google’s robots to drop advertisements onto the page. Maybe the bride retreats from the splog back to the search engine and looks for a legitimate blog. Then again, maybe she clicks on one of the advertisements. Ka-ching! The splogger gets fifty cents, a dollar, maybe two dollars. The bride may not realize as she clicks the ad that she’s the only human in a drama dominated by robots.

  This is bound to happen more and more. Because our information travels by itself, untethered from our bodies, machines can forge and plagiarize human communication on a massive scale. This poses a never-ending challenge in the world of the Numerati: the better automatic systems understand us, the better they can pretend to be us.

  The phenomenal growth of splogs in 2005 threatened Umbria’s entire business. Suddenly, the company’s market research was reflecting the views, concerns, and consumer habits of . . . androids. Who would pay for that? “If you don’t take care of spam,” Nicolov says, “it makes your analysis bogus.” Initially, the Umbria team tried to weed out the splogs manually. But as the plague grew, they saw they needed to devote much of their research effort to spam fighting.

  For a few frantic months in 2005, Umbria scientists struggled to teach their machines how to distinguish between the work of other machines and that of humans. For this, Umbria’s scientists looked to geometry. This may come as a surprise to those of us who associate geometry with the pointy compasses and plastic protractors we carried around in middle school. But advanced geometry is a growing force in the expanding universe of the Numerati. From the king-sized laboratories of Google to small shops like Umbria, scientists often describe the world of data as a domain of sharp angles, colliding planes, and vectors shooting along endless paths.

  Imagine a vast multidimensional space, Nicolov instructs me. Remember that each document Umbria studies has dozens of markers—the strange spellings, fonts, word choices, themes, colors, and grammar that set it apart from others. In this enormous space I’m supposed to imagine, each marker occupies its own patch of real estate. This is a universe that spans the quirks, the table of contents, even the punctuation of the blogosphere. Picture the theme “iPod” somewhere near Pluto, and the emoticon : ( in the vicinity of the North Star. Thousands of these markers are scattered about. And each document—blog or splog—is given an assignment: it must produce a line—or vector—that intersects with each and every one of its own markers in the entire universe. It’s a little like those grade-school exercises where a child follows a series of numbers or letters with her pencil and ends up with a picture of a puppy or a Christmas tree.

  But Umbria’s vectors aren’t nearly so simple. Nicolov tries to draw a diagram on the whiteboard. But he gives up in short order. It’s impossible, because in a world of two dimensions, or even three, each of the vectors would have to squiggle madly and perform ridiculous U-turns to meet up with each of its markers. The resulting map would look more like a plate of spaghetti than the straight arrows demanded by this so-called support vector machine. Even if we can’t picture it with our earthbound minds, the computer has no trouble depicting the documents—blog posts and splogs alike—as vectors. They all run neatly from one dimension through countless others and, more important, through every one of their distinguishing markers. Intergalactic arrows galore. But there’s a certain order to them. Documents that resemble each other, naturally enough, are neighbors in this vector space. The ones about Iraq congregate around one constellation, those about deodorants around another. A blog about deodorants in Iraq (believe me, they’re out there) spans the two constellations. Blogs that have a lot in common tend to point at similar angles.

  In an ideal world, the splogs’ vectors would all reside in the same underworld. Then Nicolov and his team could sequester them. But in the beginning, they usually mingle with legitimate blogs. Their authors, after all, take great pains to make them fit in. This means that the Umbria team must dig for more variables, more qualities in a blog that set the humans apart. The process is similar to the fraud detection that humans have been engaged in throughout history. I remember reading about German spies in World War II who spoke perfect American English. They knew about Franklin Roosevelt’s fireside chats and Betty Grable’s famous legs. They could talk about high school life outside St. Louis, reminisce about dancing to Glenn Miller’s big trombone at the prom. Suspicious American interrogators had to look for precise markers to set them apart. They’d ask the spies about spitballs and double plays. Maybe they tried out some old “Knock, knock” jokes. In this same way, Nicolov’s team searches for variables that will betray the splogs—and set their nasty vectors careening into a neighborhood of their own.

  What next? The splog neighborhood must be cordoned off, condemned. Imagine placing a big shield between the good and bad vectors. Speaking geometrically, the shield is a plane. The spam fighters maneuver it with a mouse, up and down, this way and that. The plane defines the border between the two worlds, and as the scientists position it, the machine churns through thousands of rules and statistics that divide legitimate blogs from spam.

  ONE FRIGID MORNING before dawn I sit in the only open café I can find in Boulder, and I blog. It’s a post kvetching about annoying advertisements on the U.S. Airways flight from Newark to Denver. Before I’ve finished my coffee, that post is rocketing to a computer server in New York City and up onto the blog. Like millions of other posts every hour, it sends out pings. These are updates to the computers monitoring the blog world. Thanks to these pings, search engines and blog analysts like Umbria don’t have to go hunting and gathering in the world of blogs. That would take too long for a medium that changes by the minute. They simply open their digital doors and the posts arrive. It’s as if they have subscriptions. Within minutes, Umbria’s spam-fighting system has received my humble post and drawn it as a vector—hopefully, one flying on the safe side of the spam plane. Later it is classified by gender and generation and sentiment.

  Okay. Umbria can turn my post into a vector. But can Nicolov and his colleagues turn me into a vector? After all, if each blog post can be defined geometrically, each blogger can be as well. It’s just a matter of breaking down our blogs into pieces, or variables. It might analyze the subjects and blogs we write about, where we come from, the language we write in. I ask Nico
lov if it’s possible. Of course, he says. But it’s much simpler, for now, to ignore the individuals and focus on their opinions. Each post, in that sense, participates in surveys. If U.S. Airways hires Umbria, they will see that at least one blogger, apparently a middle-aged male, has negative sentiments about their onboard advertising. This analysis is a bit like an election or the census. All voices are equal.

  But as the computers cast an ever-wider net for our words and online gestures, growing numbers of Numerati will be training their vector machines on the individual. BuzzMetrics already models 2,000 of the most popular bloggers. Each is represented as an amalgam of their language, the themes they cover, and the other blogs they link to. Each of these bloggers is a hub of activity. Analysts can measure their influence and map the constellations of smaller bloggers deployed around them. With this information in hand, advertisers can buy space on targeted blogs and measure, hour by hour, the buzz that each one produces.

  Can the automatic technologies that parse the words of bloggers do the same thing for some mastermind in Islamabad or London who’s deploying battalions of suicide bombers? Can that person’s vector be isolated like a splog? And what about others of us—you, me, Tears of Lust—whose vectors just happen to be passing through the same neighborhood? Umbria can screw up hundreds of times each day. Agencies tracking down terrorists won’t have that luxury.

  Chapter 5

  * * *

  Terrorist

  A SCHOOL BUS pulls up beside my car. Kids stream out and make their noisy way into the National Cryptologic Museum in Fort Meade, Maryland—my destination as well. I’m a little early. Just across a broad avenue, beyond an imposing hightech fence and a vast acreage of parking space, stands the country’s headquarters of electronic espionage, the National Security Agency. I recognize the black, glass-walled cubes of the NSA from a refrigerator magnet a friend gave me a couple of years ago. It shows a bolt of lightning shooting down through the purple evening sky onto the taller of the two buildings. It appears either to be smiting the secretive workshop or imbuing it with righteous force from above, depending on your perspective. I’m here for a talk with the NSA’s chief mathematician, James Schatz. It’s apparently a lot easier for him to cross the street to this little museum than for me to get security clearance and make my way into the NSA fortress.

  The NSA was at the center of the information war on terrorism long before 9/11. But the profile of the secretive spy agency rose following the attacks. It became all too clear that the United States lacked on-the-ground intelligence in the war against Al Qaeda. Most spies and special forces in the Mideast struggled even to make a phone call in Arabic. Few could hope to infiltrate the terror network, much less locate and capture Osama bin Laden. The answer to this shortage, for many, was to comb through digital data. “It will be their sons against our silicon,” wrote Peter Huber and Fred Mills of ICX Technology, a high-tech surveillance company, in the winter of 2002.

  What sorts of data would fuel the hunt for terrorists? Practically anything the government could get its hands on. In the years following 9/11, the government spent more than $1 billion to merge its enormous databases, including those of the FBI and the CIA. This would give data miners a single unified resource. But that wasn’t all. They would also trawl oceans of consumer and demographic details, airline records and hotel receipts, along with videos, photos, and millions of hours of international phone and Internet traffic harvested by the NSA. This trove matched anything that the Web giants Yahoo and Google were grappling with. In May 2006, news surfaced that the NSA was secretly extending its nets even further. USA Today reported that major phone companies had delivered hundreds of billions of phone records to the government. These provided details on who was calling whom, from where, for how long, and whether the call was forwarded. Were the NSA staff also listening in on the calls and reading the e-mails? There was no telling. But the Bush administration made clear that when it came to antiterrorism efforts, few legalities involving congressional disclosure or court approval should get in the way. Consequently, the details of our lives flow into those databases, and it’s up to government data miners to weed out the terrorists among us.

  Can the Numerati at the NSA use the statistical techniques we’ve seen in politics and advertising to trace the path of a terrorist? Are the behavior patterns of suicide bombers similar in meaningful ways to those of foreign-movie buffs on Netflix, social butterflies on Facebook, or outlier Republicans in Greenwich Village? These are the questions I mull as I sit outside the Cryptologic Museum.

  I was scheduled to have this meeting months ago. But the NSA got caught up in the controversy surrounding wiretapping done without a legal warrant, and I got put on hold. Every time I encountered one of the Numerati while the interview was pending, I asked about the challenges facing the data miners at the NSA. What I learned was sobering. The hazards of tracking terrorists along paths of electronic data are formidable, and the risks of screwing up enormous. No prudent rulers, I’m convinced, would dream of entrusting citizens’ lives to these methods—unless the safety of their country hinged on it and they had precious few other options. This, sadly, appears to be the fear. And many of us may find ourselves caught in their net.

  When it comes to data mining, potential terrorists differ from, say, caviar buyers at Safeway in three crucial ways. First, there’s a lack of historical record. It’s nearly impossible to build a predictive model of rare or unprecedented events, such as the attacks on Spanish trains, a nightclub in Bali, and the World Trade Center. This is because math-based predictions rely on patterns of past behavior. Let’s say I fly to Taiwan tomorrow and purchase 200 Michelin tires with my credit card. Within minutes, MasterCard will be calling my house in New Jersey, asking if that’s really me on an Asian spree. My buying patterns and those of card thieves are etched into their system. A computer program known as a neural network races through millions of transactions and establishes the limits of normal behavior. It throws up a red flag when it sees a deviation that could signal a stolen card. (This type of program detected financial irregularities on the part of New York governor Eliot Spitzer in 2007. The trail of these monies led to the discovery of payments for prostitutes and his resignation in March 2008.) But such tools are useless when it comes to recognizing or predicting something never seen before—the unexpected earth-shaking events that the author Nassim Nicholas Taleb discusses in his book The Black Swan.

  The second problem is that suspected terrorists, unlike most shoppers or voters, take measures to blur the data signal to cover their tracks. The simplest way is to conduct important business off the network—to hold meetings face to face and send coded messages on paper or committed to the memory of human couriers. But terrorists can also manipulate the data that gets picked up, distorting what industry insiders call the “feedback loop.” They can run preparations for a big bombing or hijacking, for example, elicit a response from Western intelligence agencies—and then simply hold off the attack. Jerry Friedman, a statistics professor at Stanford, compares the effect of this tactic to car alarms that go off constantly, lulling people into ignoring them. From the data miners’ perspective, the non-event appears to be a false positive. They might conclude incorrectly that their algorithms need an overhaul. By tinkering with the data, the terrorists play with their heads.

  Finally, failure in this realm of data mining can destroy lives. Remember Ted Kremer’s shrug when his automatic reader at Umbria misread a blog post and concluded that the writer was down on Apple? Who cared? The machine got it right most of the time. Nor did it matter if Josh Gotbaum’s algorithms misclassified me as a Barn Raiser or a Right Click. My pile of junk mail would be only slightly less relevant than usual. The ideal industries for the Numerati are those in which they can goof up regularly and still top the status quo. This is hardly the case in the war against terrorism. Innocent people who get swept up in the terror net can find themselves living a nightmare. This is all the more true where traditional
protections, such as presumption of innocence and the right of habeas corpus, are not guaranteed, and torture is tolerated.

  In the realm of counterterrorism, hundreds of millions of us are reduced to the role of supporting players, extras. The focus is no longer on us, as it was at the office and at the supermarket. The Numerati at the NSA and similar agencies around the world are attempting to track down only the tiny fraction of killers in our midst. But here’s the rub. For the researchers to pick out these outliers, they must first figure out what’s “normal.” Picture our society on a big piece of posterboard. At first glance it looks entirely blue, monochromatic. But step closer, and you’ll see tiny dots and strings of red. That background of blue represents boring, law-abiding (for the most part) us. Our only function on this display is to bring into relief the bits of red. Those are the suspected terrorists. Analysts paint that blue with the details of our lives. For this to happen, we must become known. And sometimes, if the algorithms are a little off or our behavior falls out of step with the pack, our blue gives off the slightest rosy glow. That’s when trouble calls.

  At precisely ten o’clock, James Schatz arrives at the museum with his press representative. He’s bald and neatly dressed in a pressed white shirt and a tie. He walks with the posture and precision of a geometer. I’ve been briefed that policy questions will be off limits. The discussion will center on the mathematical and statistical approaches to intelligence. As we make our way into a small, sparse conference room at the museum and set up our recorders, I recall a recent conversation with Prabhakar Raghavan, the chief of research at Yahoo. He was describing how some analysts get so tangled up in huge amounts of data that they slot two people who should be in the same bucket into different ones. Maybe one of them is 51 years old, for example, and the other’s 49. That bit of data sends them into separate buckets, though there’s no good reason that it should. I wonder if there’s a similar problem at the NSA.

 

‹ Prev