In The Plex
Page 3
Having a human being determine the ratings was out of the question. First, it was inherently impractical. Further, humans were unreliable. Only algorithms—well drawn, efficiently executed, and based on sound data—could deliver unbiased results. So the problem became finding the right data to determine whose comments were more trustworthy, or interesting, than others. Page realized that such data already existed and no one else was really using it. He asked Brin, “Why don’t we use the links on the web to do that?”
Page, a child of academia, understood that web links were like citations in a scholarly article. It was widely recognized that you could identify which papers were really important without reading them—simply tally up how many other papers cited them in notes and bibliographies. Page believed that this principle could also work with web pages. But getting the right data would be difficult. Web pages made their outgoing links transparent: built into the code were easily identifiable markers for the destinations you could travel to with a mouse click from that page. But it wasn’t obvious at all what linked to a page. To find that out, you’d have to somehow collect a database of links that connected to some other page. Then you’d go backward.
That’s why Page called his system BackRub. “The early versions of hypertext had a tragic flaw: you couldn’t follow links in the other direction,” Page once told a reporter. “BackRub was about reversing that.”
Winograd thought this was a great idea for a project, but not an easy one. To do it right, he told Page, you’d really have to capture a significant chunk of the World Wide Web’s link structure. Page said, sure, he’d go and download the web and get the structure. He figured it would take a week or something. “And of course,” he later recalled, “it took, like, years.” But Page and Brin attacked it. Every other week Page would come to Garcia-Molina’s office asking for disks and equipment. “That’s fine,” Garcia-Molina would say. “This is a great project, but you need to give me a budget.” He asked Page to pick a number, to say how much of the web he needed to crawl, and to estimate how many disks that would take. “I want to crawl the whole web,” Page said.
Page indulged in a little vanity in naming the part of the system that rated websites by the incoming links: he called it PageRank. But it was a sly vanity; many people assumed the name referred to web pages, not a surname.
Since Page wasn’t a world-class programmer, he asked a friend to help out. Scott Hassan was a full-time research assistant at Stanford, working for the Digital Library Project program while doing part-time grad work. Hassan was also good friends with Brin, whom he’d met at an Ultimate Frisbee game during his first week at Stanford. Page’s program “had so many bugs in it, it wasn’t funny,” says Hassan. Part of the problem was that Page was using the relatively new computer language Java for his ambitious project, and Java kept crashing. “I went and tried to fix some of the bugs in Java itself, and after doing this ten times, I decided it was a waste of time,” says Hassan. “I decided to take his stuff and just rewrite it into the language I knew much better that didn’t have any bugs.”
He wrote a program in Python—a more flexible language that was becoming popular for web-based programs—that would act as a “spider,” so called because it would crawl the web for data. The program would visit a web page, find all the links, and put them into a queue. Then it would check to see if it had visited those link pages previously. If it hadn’t, it would put the link on a queue of future destinations to visit and repeat the process. Since Page wasn’t familiar with Python, Hassan became a member of the team. He and another student, Alan Steremberg, became paid assistants to the project.
Brin, the math prodigy, took on the huge task of crunching the mathematics that would make sense of the mess of links uncovered by their monster survey of the growing web.
Even though the small team was going somewhere, they weren’t quite sure of their destination. “Larry didn’t have a plan,” says Hassan. “In research you explore something and see what sticks.”
By March 1996, they began a test, starting at a single page, the Stanford computer science department home page. The spider located the links on the page and fanned out to all the sites that linked to Stanford, then to the sites that linked to those websites. “That first one just used the titles of documents because collecting the documents themselves required a lot of data and work,” says Page. After they snared about 15 million of those titles, they tested the program to see which websites it deemed more authoritative.
“Even the first set of results was very convincing,” Hector Garcia-Molina says. “It was pretty clear to everyone who saw this demo that this was a very good, very powerful way to order things.”
“We realized it worked really, really well,” says Page. “And I said, ‘Wow, the big problem here is not annotation. We should now use it not just for ranking annotations, but for ranking searches.’” It seemed the obvious application for an invention that gave a ranking to every page on the web. “It was pretty clear to me and the rest of the group,” he says, “that if you have a way of ranking things based not just on the page itself but based on what the world thought of that page, that would be a really valuable thing for search.”
The leader in web search at that time was a program called AltaVista that came out of Digital Equipment Corporation’s Western Research Laboratory. A key designer was Louis Monier, a droll Frenchman and idealistic geek who had come to America with a doctorate in 1980. DEC had been built on the minicomputer, a once innovative category now rendered a dinosaur by the personal computer revolution. “DEC was very much living in the past,” says Monier. “But they had small groups of people who were very forward-thinking, experimenting with lots of toys.” One of those toys was the web. Monier himself was no expert in information retrieval but a big fan of data in the abstract. “To me, that was the secret—data,” he says. What the data was telling him was that if you had the right tools, it was possible to treat everything in the open web like a single document.
Even at that early date, the basic building blocks of web search had been already set in stone. Search was a four-step process. First came a sweeping scan of all the world’s web pages, via a spider. Second was indexing the information drawn from the spider’s crawl and storing the data on racks of computers known as servers. The third step, triggered by a user’s request, identified the pages that seemed best suited to answer that query. That result was known as search quality. The final step involved formatting and delivering the results to the user.
Monier was most concerned with the second step, the time-consuming process of crawling through millions of documents and scooping up the data. “Crawling at that time was slow, because the other side would take on average four seconds to respond,” says Monier. One day, lying by a swimming pool, he realized that you could get everything in a timely fashion by parallelizing the process, covering more than one page at a time. The right number, he concluded, was a thousand pages at once. Monier figured out how to build a crawler working on that scale. “On a single machine I had one thousand threads, independent processes asking things and not stepping on each other’s toes.”
By late 1995, people in DEC’s Western Research Lab were using Monier’s search engine. He had a tough time convincing his bosses to open up the engine to the public. They argued that there was no way to make money from a search engine but relented when Monier sold them on the public relations aspect. (The system would be a testament to DEC’s powerful new Alpha processing chip.) On launch day, AltaVista had 16 million documents in its indexes, easily besting anything else on the net. “The big ones then had maybe a million pages,” says Monier. That was the power of AltaVista: its breadth. When DEC opened it to outsiders on December 15, 1995, nearly 300,000 people tried it out. They were dazzled.
AltaVista’s actual search quality techniques—what determined the ranking of results—were based on traditional information retrieval (IR) algorithms. Many of those algorithms arose from the work of one man, a ref
ugee from Nazi Germany named Gerard Salton, who had come to America, got a PhD at Harvard, and moved to Cornell University, where he cofounded its computer science department. Searching through databases using the same commands you’d use with a human—“natural language” became the term of art—was Salton’s specialty.
During the 1960s, Salton developed a system that was to become a model for information retrieval. It was called SMART, supposedly an acronym for “Salton’s Magical Retriever of Text.” The system established many conventions that still persist in search, including indexing and relevance algorithms. When Salton died in 1995, his techniques still ruled the field. “For thirty years,” wrote one academic in tribute a year later, “Gerry Salton was information retrieval.”
The World Wide Web was about to change that, but the academics didn’t know it—and neither did AltaVista. While its creators had the insight to gather all of the web, they missed the opportunity to take advantage of the link structure. “The innovation was that I was not afraid to fetch as much of the web as I could, store it in one place, and have a really fast response time. That was the novelty,” says Monier. Meanwhile, AltaVista analyzed what was on each individual page—using metrics like how many times each word appeared—to see if a page was a relevant match to a given keyword in a query.
Even though there was no clear way to make money from search, AltaVista had a number of competitors. By 1996, when I wrote about search for Newsweek, executives from several companies were all boasting the most useful service. When pressed, all of them would admit that in the race between the omnivorous web and their burgeoning technology, the web was winning. “Academic IR had thirty years to get to where it is—we’re breaking new ground, but it’s difficult,” complained Graham Spencer, the engineer behind the search engine created by a start-up called Excite. AltaVista’s director of engineering, Barry Rubinson, said that the best approach was to throw massive amounts of silicon toward the problem and then hope for the best. “The first problem is that relevance is in the eye of the beholder,” he said. The second problem, he continued, is making sense of the infuriatingly brief and cryptic queries typed into the AltaVista search field. He implied that the task was akin to voodoo. “It’s all wizardry and witchcraft,” he told me. “Anyone who tells you it’s scientific is just pulling your leg.”
No one at the web search companies mentioned using links.
The links were the reason that a research project running on a computer in a Stanford dorm room had become the top performer. Larry Page’s PageRank was powerful because it cleverly analyzed those links and assigned a number to them, a metric on a scale of 1 to 10, that allowed you to see the page’s prominence in comparison to every other page on the web. One of the early versions of BackRub had simply counted the incoming links, but Page and Brin quickly realized that it wasn’t merely the number of links that made things relevant. Just as important was who was doing the linking. PageRank reflected that information. The more prominent the status of the page that made the link, the more valuable the link was and the higher it would rise when calculating the ultimate Page-Rank number of the web page itself. “The idea behind PageRank was that you can estimate the importance of a web page by the web pages that link to it,” Brin would say. “We actually developed a lot of math to solve that problem. Important pages tended to link to important pages. We convert the entire web into a big equation with several hundred million variables, which are the Page Ranks of all the web pages, and billions of terms, which are all the links.” It was Brin’s mathematic calculations on those possible 500 million variables that identified the important pages. It was like looking at a map of airline routes: the hub cities would stand out because of all the lines representing flights that originated and terminated there. Cities that got the most traffic from other important hubs were clearly the major centers of population. The same applied to websites. “It’s all recursive,” Page later said. “In a way, how good you are is determined by who links to you and who you link to determines how good you are. It’s all a big circle. But mathematics is great. You can solve this.”
The PageRank score would be combined with a number of more traditional information retrieval techniques, such as comparing the keyword to text on the page and determining relevance by examining factors such as frequency, font size, capitalization, and position of the keyword. (Those factors help determine the importance of a keyword on a given page—if a term is prominently featured, the page is more likely to satisfy a query.) Such factors are known as signals, and they are critical to search quality. There are a few crucial milliseconds in the process of a web search during which the engine interprets the keyword and then accesses the vast index, where all the text on billions of pages is stored and ordered just like an index of a book. At that point the engine needs some help to figure out how to rank those pages. So it looks for signals—traits that can help the engine figure out which pages will satisfy the query. A signal says to the search engine, “Hey, consider me for your results!” PageRank itself is a signal. A web page with a high PageRank number sends a message to the search engine that it’s a more reputable source than those with lower numbers.
Though PageRank was BackRub’s magic wand, it was the combination of that algorithm with other signals that created the mind-blowing results. If the keyword matched the title of the web page or the domain name, that page would go higher in the rankings. For queries consisting of multiple words, documents containing all of the search query terms in close proximity would typically get the nod over those in which the phrase match was “not even close.” Another powerful signal was the “anchor text” of links that led to the page. For instance, if a web page used the words “Bill Clinton” to link to the White House, “Bill Clinton” would be the anchor text. Because of the high values assigned to anchor text, a BackRub query for “Bill Clinton” would lead to www.whitehouse.gov as the top result because numerous web pages with high PageRanks used the president’s name to link the White House site. “When you did a search, the right page would come up, even if the page didn’t include the actual words you were searching for,” says Scott Hassan. “That was pretty cool.” It was also something other search engines failed to do. Even though www.whitehouse.gov was the ideal response to the Clinton “navigation query,” other commercial engines didn’t include it in their results. (In April 1997, Page and Brin found that a competitor’s top hit was “Bill Clinton Joke of the Day.”)
PageRank had one other powerful advantage. To search engines that relied on the traditional IR approach of analyzing content, the web presented a terrible challenge. There were millions and millions of pages, and as more and more were added, the performance of those systems inevitably degraded. For those sites, the rapid expansion of the web was a problem, a drain on their resources. But because of PageRank, BackRub got better as the web grew. New sites meant more links. This additional information allowed BackRub to identify even more accurately the pages that might be relevant to a query. And the more recent links would improve the freshness of the site. “PageRank has the benefit of learning from the whole of the World Wide Web,” Brin would explain.
Of course, Brin and Page had the logistical problem of capturing the whole web. The Stanford team did not have the resources of DEC. For a while, BackRub could access only the bandwidth available to the Gates Building—10 megabits of traffic per second. But the entire university ran on a giant T3 line that could operate at 45 megabits per second. The Back-Rub team discovered that by retoggling an incorrectly set switch in the basement, it could get full access to the T3 line. “As soon as they toggled that, we were all the way up to the maximum of the entire Stanford network,” says Hassan. “We were using all the bandwidth of the network. And this was from a single machine doing this, on a desktop in my dorm room.”
In those days, people who ran websites—many of them with minimal technical savvy—were not used to their sites being crawled. Some of them would look at their logs, and see frequent visits
from www.stanford.edu, and suspect that the university was somehow stealing their information. One woman from Wyoming contacted Page directly to demand that he stop, but Google’s “bot” kept visiting. She discovered that Hector Garcia-Molina was the project’s adviser and called him, charging that the Stanford computer was doing terrible things to her computer. He tried to explain to her that being crawled is a harmless, nondestructive procedure, but she’d have none of it. She called the department chair and the Stanford security office. In theory, complainants could block crawlers by putting a little piece of code on their sites called /robots.txt, but the angry webmasters weren’t receptive to the concept. “Larry and Sergey got annoyed that people couldn’t figure out /robots.txt,” says Winograd, “but in the end, they actually built an exclusion list, which they didn’t want to.” Even then, Page and Brin believed in a self-service system that worked in scale, serving vast populations. Handcrafting exclusions was anathema.
Brin and Page fell into a pattern of rapid iterating and launching. If the pages for a given query were not quite in the proper order, they’d go back to the algorithm and see what had gone wrong. It was a tricky balancing act to assign the proper weights to the various signals. “You do the ranking initially, and then you look at the list and say, ‘Are they in the right order?’ If they’re not, we adjust the ranking, and then you’re like, ‘Oh this looks really good,’” says Page. Page used the ranking for the keyword of “university” as a litmus test. He paid particular attention to the relative ranking of his alma mater, Michigan, and his current school, Stanford. Brin and Page assumed that Stanford would be ranked higher, but Michigan topped it. Was that a flaw in the algorithm? No. “We decided that Michigan had more stuff on the web, and that was reasonable,” says Page.