Voices from the Valley

Page 6

by Ben Tarnoff

Tell us about that.

Anybody who’s been around the internet for long enough has seen a CAPTCHA. These days, it’s the little thing that pops up with a checkbox that says “I am not a robot.” And sometimes it asks you to prove it by clicking images that have a taxi or a traffic light or whatever.

The professor I was working for invented the original CAPTCHA for Yahoo. Back in the day, Yahoo had a bunch of people signing up for free email accounts and then using them to send spam. The CAPTCHA was supposed to put a check on that.

The idea was that you’d display this distorted text and tell the user to type it. A computer could generate these tests very easily and know what the right answer was. But at the time it was hard for computers to read the distorted text. So the CAPTCHA prevented people from writing programs to automatically create a hundred thousand Yahoo email accounts for sending spam.

CAPTCHAs started to get used everywhere on the web. At some point we did the math and figured out, “Wow, people are filling in millions, maybe billions of these a day. They are collectively wasting a huge chunk of time typing in these obnoxious characters. Why don’t we try to do some good for the world and use CAPTCHAs to digitize books?”

How?

It’s the same idea as the original CAPTCHA. But instead of displaying random words, you’re displaying words from old books or newspapers or magazines that optical character recognition software has trouble reading. So we get humans to read the words and tell us what they are.

We would display two words. One was a word that we actually knew. The other word was taken from a scanned book, and maybe we had some guesses. We would use the word we knew to confirm that the person was actually a human. Then, assuming that they passed the first word, we would count their answer for the other word as a vote for the correct spelling of that word.

If they happened to agree with the optical character recognition software, then, great—it was probably right. If they disagreed, then maybe you send the word out to a couple more people and try to get some agreement on what the word is.

Where would you get the scanned books or newspapers or magazines?

Well, our idea was that there must be places that have old works that they want to digitize. We could partner with them. And that became the business model. It started out as an academic research project but, by the end of that summer—this was 2007—we had decided to make it into an actual company.

We ended up getting a contract with The New York Times. We started digitizing old years of the paper that were in the public domain. So we started with 1922 or 1923 and then kept going backward. We went in reverse chronological order because obviously the older scans were harder to read.

That was a really fun project to work on. We were a small team, maybe six people, and we never had an office. It felt like being at another university research lab.

Then, in 2009, we found out that Google was considering acquiring us.

Into the Mothership

Why did Google want reCAPTCHA?

They wanted to use it to digitize Google Books.

At that point, Google had been scanning books in the public domain from libraries for many years. And, it wasn’t announced publicly yet, but Google was getting into the e-book business.

They wanted an e-book service that could compete with the Amazon Kindle. And part of their plan for doing that was to say, “Hey, we can offer higher-quality versions of all these old public domain books than what Amazon has.”

At the time, Amazon was claiming to have four hundred thousand e-books. Three hundred thousand of those were public domain books that had been scanned by somebody and gone through one pass of optical character recognition. There would be typos or misspellings or wrong words everywhere. Not really joyful to read.

The idea was that Google could have a similar catalog of these older public domain works but that they would actually be readable. Honestly, I think it was largely a marketing thing. To be able to say, “We have five hundred thousand e-books and nobody else has five hundred thousand e-books.”

So reCAPTCHA would be the weapon that Google could use to beat Amazon at e-books, by offering a way to clean up text produced by not-so-great optical recognition software.

Yeah. Google had a team, and still does, that worked on optical recognition software. The project was open-source and called Tesseract. Tesseract was closely tied to the Google Books team. We met with them during our first few weeks, and sat next to them.

Tesseract was okay, but it wasn’t as good as the commercial software we had been using for reCAPTCHA, which was called ABBYY. So Google wanted reCAPTCHA to improve the text quality.

What was it like to be acquired by Google?

It was exciting. We had a lot of code that was specific to The New York Times. They had a particular format of articles and sections and so on. Now we were doing books, which is a very different sort of thing.

Also, the scale was completely new. At Google, we were working with millions of books. And we had way more access to computing power, obviously.

I’d imagine that Google had more computing resources available than your startup.

Yes, and reCAPTCHA was an old-school startup, before there was any of this cloud stuff. The front ends that served the CAPTCHAs were hosted on four servers: two on the East Coast and two on the West Coast. Eventually we added three servers in Europe for latency reasons—because of a big client that wanted low latency for their European users.

So we went from having a handful of servers that we had to manage ourselves to having as many resources as we wanted. In the first year or two at Google, we easily scaled up our traffic by six to eight times what it had been before.

When we got acquired, we were serving maybe four or five thousand CAPTCHAS per second. Which is not bad. Facebook used reCAPTCHA. So did Ticketmaster, Twitter, and a bunch of sites that were a big deal ten years ago and that nobody remembers anymore.

But within a year at Google, we were easily double or triple that. Not due to us doing any special marketing or anything. It was just organic growth from the sites that were already using us, plus others saying, “Okay, they’re part of Google now, so they’re not going to just disappear.” Which I guess is different than what people say when startups get acquired by Google these days!

These days, when they buy a startup, they usually just burn it down so it doesn’t become a competitor.

Yeah.

You mentioned that Google Books had already scanned millions of books. Where did those books come from originally?

By the time we arrived, Google Books had been going on for years. It was first announced back in 2004. All of the book scanning was done in collaboration with libraries. Harvard and the University of Michigan were the two largest ones in the U.S.

The way it worked was that books that weren’t checked out would get trucked off from the library to a Google scan center. There, they had people turning the pages and taking photos with cameras from above to scan them, to ensure it was a nondestructive scanning process.

I did get to see a scan center at one point. It’s one of the first situations that I became aware of where Google was using TVCs.3 Google wasn’t directly employing the people scanning the books—there was some third party that was responsible for the scan center operations at any given place.

Libraries liked the project, because their whole point was the preservation of written material. So preserving that material digitally for future generations seemed good. And the libraries got the scans of the books to do whatever they wanted to with them.

What was the original impetus for the project?

My best guess is that Larry Page just thought it would be cool. He probably decided it was worth doing because compared to the scale of Google, the amount of resources required was not huge. And it made sense given Google’s culture and mission back in 2004. Google was a search engine. The reasoning that I always heard was, “So you can search the web, but there’s a whole bu
nch of human knowledge that’s stuck in dead-tree form. Why can’t you search all of that as well?”

The Culture Is Changing

I’m curious about your evolution in how you saw the company. When you first joined, you were excited. Suddenly you had a huge amount of computing power at your disposal, and you were part of this big, ambitious project to digitize the world’s books. How did that feeling evolve over time?

Many different things changed at Google, both culturally and engineering-wise, over the nine years I was there.

The Google of nine years ago felt much closer to Larry and Sergey’s original vision. It was honest techno-utopianism. Google Books was a great example of that. “We’re just going to try and scan all the world’s books because we did some numbers on the back of an envelope and it seemed like we could.” And they did actually scan 20 percent of all the books that had ever been published!4

When I started, Eric Schmidt was in charge. He seemed fine with having a loose, university-style atmosphere. Different teams worked on different things. Some of them succeeded and some of them didn’t. But the company made a ridiculous amount of money from Search so it didn’t really matter that much.

Then Larry Page became CEO in 2011, and things started to change.

How?

He introduced a lot more structure and hierarchy. Previously, there were relatively few divisions. Most projects were under Search—one might be web search, another might be book search. Larry reorganized the company into major product areas. Search was a division, but not the only one; there was also Android, Cloud, and so on. And he put a senior vice president in charge of each of them.

Right away, there was less of that university atmosphere where you could just walk up and talk to anybody about their project and maybe help them out with it. Now the decisions were coming down from the senior vice president who was in charge of that product area. And those decisions were being driven by business objectives: overall, the company started caring more about the business and less about whimsical projects. Google started to become more like a typical big company.

You mentioned that you first encountered TVCs when working on Google Books. As the company began to change and become more corporate, did you notice more and more contractors around?

I can’t think of a specific moment where there was a sudden influx. But percentage-wise, more and more of Google became temporary workers. It’s now a majority—Google employs more TVCs than full-time employees.5 It was probably 10 or 20 percent when I started.

At first, the introduction of TVCs seemed justified. Google is not in the business of hiring people to do everything. It doesn’t have the time. So having some third party manage that seemed like it made sense. But over time, Google has gone from, “There’s this special project that needs a few hundred people who have skill sets that no Googler currently has,” to taking full-time positions and turning them into temporary or contract positions.

For example?

Recruiters. I had some awareness of this when I arrived, because it was in the middle of the Great Recession. Before, recruiters were full-time employees. But after the financial crisis, they scaled down hiring and basically fired a bunch of people because they had nothing to do. Then when they spun hiring back up again, résumé screeners and college recruiting coordinators were hired on a temporary basis. I remember older employees who had been there longer than me grumbling about it. They used to know the recruiting folks in this office or that office. Then they started turning over every year or two.

It makes everyone’s life worse. That’s the point. I worked on one project where we hired a third-party design firm to do the web design for a data visualization that we were gonna release publicly. And that was incredibly obnoxious from an engineering perspective because they can’t see our code base. So I’m writing some stuff and they’re writing some stuff and when we stick it together it’s a giant mess because we’re not all developing in the same place.

In addition to more hierarchy and more contractors, what were some of the other elements that started to change how you felt about the company?

The first thing that occurs to me is Google Plus.

Google Plus was meant to be a social networking competitor of Facebook. From the start, a decision was made that people were going to have to use their real names on Google Plus. A bunch of Googlers then pointed out that this policy was problematic for a bunch of reasons. Trans people may be known by different names in different contexts. Sex workers might not feel safe using their real name. More generally, anybody who doesn’t want to be automatically doxing6 themselves for the opinions they post on the internet might not want to use their real name.

Why did management want real names?

Their main argument was that anonymous discourse on the internet is toxic. The idea was that if you made people sign up with their real name, there would be less bad online behavior, less trolling.

I very specifically remember an exec making an analogy to a restaurant. When you go to a nice restaurant, you have to wear a shirt and pants. If you want to eat at home, you can eat wearing whatever you want. But as members of polite society, we accept certain restrictions.

The analogy landed extremely poorly on a bunch of people internally. These execs have millions of dollars and are basically public figures. Of course they don’t have a problem with using their real names, so they couldn’t possibly imagine why anyone wouldn’t want to. Also, from a logical standpoint, their argument didn’t make a lot of sense. You can come up with an alias that looks like a real name and post the most toxic stuff in the world. It’s not violating the names policy that’s the problem—it’s the behavior you’re engaging in.

So what happened?

It took a while, but Googlers were able to push back and get the policy changed. It ended up in the state that I think should have been the state initially: you can type anything you want into the name field, so long as it’s not offensive or you’re not impersonating anyone. In fact, I remember one particular instance in which we disabled Neil Gaiman’s account for impersonating Neil Gaiman! He escalated on Twitter, and Googlers went and fixed it.

It sounds like the real-names policy on Google Plus was an early example of a rift between rank-and-file Googlers and management.

Yes. Although in the subsequent years, many more rifts manifested and grew wider. Because back then, the feedback mechanism within Google was still working. There was still a measure of trust.

People on the Internet Are Jerks

When did that trust start to break down?

The Damore memo was definitely a turning point.

In July 2017, James Damore wrote and circulated his memo. He was fired the following month. Soon after, there were all these leaks of conversations within Google that got sent to his lawyers—screenshots of email threads or internal Google Plus threads. A lot of the posts had nothing to do with the memo. In many cases, they were written years before the memo. But Damore’s legal counsel used them to make Google look like one big evil leftist conspiracy.

Then the leaks ended up on right-wing sites. A bunch of Googlers found themselves getting doxed and getting death threats. A lot of people were pretty scared. People’s photos were getting posted on 4chan and Stormfront and 8chan and all these other terrible sites.7 A bunch of well-known alt-right provocateurs, including Vox Day and Milo Yiannopoulos and various others on Breitbart, were involved.

What was the reaction to all of that within Google?

We were completely blindsided. There had never been any culture of leaking internal posts to score political points before. Or of people getting doxed and threatened. And the company didn’t know how to deal with it.

Google has a physical security group that is very responsive. If there is an earthquake or a natural disaster or something, they call all the Googlers in the area to ensure they’re safe and to provide help if needed. But for this kind of online attack, they didn’t have a clue what to do.

> Initially, there was no official support from Google for the Googlers who were affected. We got sent some useless stock resources that told us not to use our real name and address online—ironic, given the Google Plus controversy. Nobody seemed to have any idea what the hell was going on.

Was there any kind of response from upper management?

Google’s lawyers made the argument that the court should redact employees’ names because they weren’t relevant to the lawsuit. Eventually the judge agreed, and the docs that were leaked were retroactively redacted from the official court website.

But by then the screenshots were all over the most toxic parts of the right-wing internet. You can’t remove stuff once it’s gotten out there. Some Googlers put together a letter to management asking for more resources for keeping the workplace safe. Basic things, like having codes of conduct on internal mailing lists that were unmoderated. But the letter was largely ignored.

Were many people you worked with sympathetic to Damore?

I don’t know the numbers. Nobody I worked closely with. On some mailing lists, there were certainly people who took his firing as proof that Google was biased against conservative employees. Of course, the downside of Google’s mailing list culture is that it’s easy for twenty or thirty people to troll every thread.

‹ Prev Next ›