Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers

Page 3

by John MacCormick

Now, remember why we introduced these in-page word locations: it was to solve the problem of how to do phrase queries efficiently. So let's see how to do a phrase query with this new index. We'll work with the same query as before, “cat sat”. The first steps are the same as with the old index: extract the locations of the individual words from the index, so for “cat” we get 1-2, 3-2, and for “sat” we get 1-3, 3-7. So far, so good: we know that the only possible hits for the phrase query “cat sat” can be on pages 1 and 3. But just like before, we are not yet sure whether that exact phrase occurs on those pages—it is possible that the two words do appear, but not next to each other in the correct order. Luckily, it is easy to check this from the location information. Let's concentrate on page 1 initially. From the index information, we know that “cat” appears at position 2 on page 1 (that's what the 1-2 means). And we know that “sat” appears at position 3 on page 1 (that's what the 1-3 means). But if “cat” is at position 2, and “sat” is at position 3, then we know “sat” appears immediately after “cat” (because 3 comes immediately after 2)—and so the entire phrase we are looking for, “cat sat,” must appear on this page beginning at position 2!

Top: Our three web pages with in-page word locations added. Bottom: A new index that includes both page numbers and in-page word locations.

I know I am laboring this point, but the reason for going through this example in excruciating detail is to understand exactly what information is used to arrive at this answer. Note that we have found a hit for the phrase “cat sat” by looking only at the index information (1-2, 3-2 for “cat,” and 1-3, 3-7 for “sat”), not at the original web pages themselves. This is crucial, because we only had to look at the two entries in the index, rather than reading through all of the pages that might be hits—and there could be literally millions of such pages in a real search engine performing a real phrase query. To summarize: including the in-page word locations in the index has allowed us to find a phrase query hit by looking at only a couple of lines in the index, rather than reading through a large number of web pages. This simple word-location trick is one of the keys to making search engines work!

Actually, we haven't even finished working through the “cat sat” example. We finished processing the information for page 1, but not for page 3. But the reasoning for page 3 is similar: we see that “cat” appears at location 2, and “sat” occurs at location 7, so they cannot possibly occur next to each other—because 7 is not immediately after 2. So we know that page 3 is not a hit for the phrase query “cat sat”, even though it /s a hit for the multiword query cat sat.

By the way, the word-location trick is important for more than just phrase queries. As one example, consider the problem of finding words that are near to each other. On some search engines, you can do this with the NEAR keyword in the query. In fact, the AltaVista search engine offered this facility from its early days and still does at the time of writing. As a specific example, suppose that on some particular search engine, the query cat NEAR dog finds pages in which the word “cat” occurs within five words of the word “dog.” How can we perform this query efficiently on our data set? Using word locations, it's easy. The index entry for “cat” is 1-2, 3-2, and the index entry for “dog” is 2-2, 3-6. So we see immediately that page 3 is the only possible hit. And on page 3, “cat” appears at location 2, and “dog” appears at location 6. So the distance between the two words is 6 – 2, which is 4. Therefore, “cat” does appear within five words of “dog,” and page 3 is a hit for the query cat NEAR dog. Again, note how efficiently we could perform this query: there was no need to read through the actual content of any web pages—instead, only two entries from the index were consulted.

It turns out that NEAR queries aren't very important to search engine users in practice. Almost no one uses NEAR queries, and most major search engines don't even support them. But despite this, the ability to perform NEAR queries is actually crucial to real-life search engines. This is because the search engines themselves are constantly performing NEAR queries behind the scenes. To understand why, we first have to take a look at one of the other major problems that confronts modern search engines: the problem of ranking.

RANKING AND NEARNESS

So far, we've been concentrating on the matching phase: the problem of efficiently finding all of the hits for a given query. But as emphasized earlier, the second phase, “ranking,” is absolutely essential for a high-quality search engine: this is the phase that picks out the top few hits for display to the user.

Let's examine the concept of ranking a little more carefully. What does the “rank” of a page really depend on? The real question is not “Does this page match the query?” but rather “Is this page relevant to the query?” Computer scientists use the term “relevance” to describe how suitable or useful a given page is, in response to a particular query.

As a concrete example, suppose you are interested in what causes malaria, and you enter the query malaria cause into a search engine. To keep things simple, imagine there are only two hits for that query in the search engine—the two pages shown in the figure on the following page. Have a look at those pages now. It should be immediately clear to you, as a human, that page 1 is indeed about the causes of malaria, whereas page 2 seems to be the description of some military campaign which just happens, by coincidence, to use the words “cause” and “malaria.” So page 1 is undoubtedly more “relevant” to the query malaria cause than page 2. But computers are not humans, and there is no easy way for a computer to understand the topics of these two pages, so it might seem impossible for a search engine to rank these two hits correctly.

Top: Two example web pages that mention malaria.

Bottom: Part of the index built from the above two web pages.

However, there is, in fact, a very simple way to get the ranking right in this case. It turns out that pages where the query words occur near each other are more likely to be relevant than pages where the query words are far apart. In the malaria example, we see that the words “malaria” and “cause” occur within two words of each other in page 1, but are separated by 17 words in page 2. (And remember, the search engine can find this out efficiently by looking at just the index entries, without having to go back and look at the web pages themselves.) So although the computer doesn't really “understand” the topic of this query, it can guess that page 1 is more relevant than page 2, because the query words occur much closer on page 1 than on page 2.

To summarize: although humans don't use NEAR queries much, search engines use the information about nearness constantly to improve their rankings—and the reason they can do this efficiently is because they use the word-location trick.

An example set of web pages that each have a title and a body.

We already know that the Babylonians were using indexing 5000 years before search engines existed. It turns out that search engines did not invent the word-location trick either: this is a well-known technique that was used in other types of information retrieval before the internet arrived on the scene. However, in the next section we will learn about a new trick that does appear to have been invented by search engine designers: the metaword trick. The cunning use of this trick and various related ideas helped to catapult the AltaVista search engine to the top of the search industry in the late 1990s.

THE METAWORD TRICK

So far, we've been using extremely simple examples of web pages. As you probably know, most web pages have quite a lot of structure, including titles, headings, links, and images, whereas we have been treating web pages as just ordinary lists of words. We're now going to find out how search engines take account of the structure in web pages. But to keep things as simple as possible, we will introduce only one aspect of structuring: we will allow our pages to have a title at the top of the page, followed by the body of the page. The figure above shows our familiar three-page example with some titles added.

Actually, to analyze web page structur
e in the same way that search engines do, we need to know a little more about how web pages are written. Web pages are composed in a special language that allows web browsers to display them in a nicely formatted way. (The most common language for this purpose is called HTML, but the details of HTML are not important for this discussion.) The formatting instructions for headings, titles, links, images, and the like are written using special words called metawords. As an example, the metaword used to start the title of a web page might be , and the metaword for ending the title might be . Similarly, the body of the web page could be started with and ended with . Don't let the symbols “<” and “>” confuse you. They appear on most computer keyboards and are often known by their mathematical meanings as “less than” and “greater than.” But here, they have nothing whatsoever to do with math—they are just being used as convenient symbols to mark the metawords as different from regular words on a web page.

The same set of web pages as in the last figure, but shown as they might be written with metawords, rather than as they would be displayed in a web browser.

Take a look at the figure above, which displays exactly the same content as the previous figure, but now showing how the web pages were actually written, rather than how they would be displayed in a web browser. Most web browsers allow you to examine the raw content of a web page by choosing a menu option called “view source”—I recommend experimenting with this the next time you get a chance. (Note that the metawords used here, such as and , are fictitious, easily recognizable examples to aid our understanding. In real HTML, metawords are called tags. The tags for starting and ending titles in HTML are Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers (John MacCormick) » p.3 » Global Archive Voiced Books Online Free—search for these tags after using the “view source” menu option.)

When building an index, it is a simple matter to include all of the metawords. No new tricks are needed: you just store the locations of the metawords in the same way as regular words. The figure on the next page shows the index built from the three web pages with metawords. Take a look at this figure and make sure you understand there is nothing mysterious going on here. For example, the entry for “mat” is 1-11, 2-11, which means that “mat” is the 11th word on page 1 and also the 11th word on page 2. The metawords work the same way, so the entry for “,” which is 1-4, 2-4, 3-4, means that “” is the fourth word in page 1, page 2, and page 3.

We'll call this simple trick, of indexing metawords in the same way as normal words, the “metaword trick.” It might seem ridiculously simple, but this metaword trick plays a crucial role in allowing search engines to perform accurate searches and high-quality rankings. Let's look at a simple example of this. Suppose for a moment that a search engine supports a special type of query using the IN keyword, so that a query like boat IN TITLE returns hits only for pages that have the word “boat” in the title of the web page, and giraffe IN BODY would find pages whose body contains “giraffe.” Note that most real search engines do not provide IN queries in exactly this way, but some of them let you achieve the same effect by clicking on an “advanced search” option where you can specify that your query words must be in the title, or some other specific part of a document. We are pretending that the IN keyword exists purely to make our explanations easier. In fact, at the time of writing, Google lets you do a title search using the keyword intitle:, so the Google query intitle:boat finds pages with “boat” in the title. Try it for yourself!

The index for the web pages shown in the previous figure, including metawords.

How a search engine performs the search dog IN TITLE.

Let's see how a search engine could efficiently perform the query dog IN TITLE on the three-page example shown in the last two figures. First, it extracts the index entry for “dog,” which is 2-3, 2-7, 3-11. Then (and this might be a little unexpected, but bear with me for a second) it extracts the index entries for both and . That results in 1-1, 2-1, 3-1 for and 1-4, 2-4, 3-4 for . The information extracted so far is summarized in the figure above—you can ignore the circles and boxes for now.

The search engine then starts scanning the index entry for “dog,” examining each of its hits and checking whether or not it occurs inside a title. The first hit for “dog” is the circled entry 2-3, corresponding to the third word of page number 2. By scanning along the entries for , the search engine can find out where the title for page 2 begins—that should be the first number that starts with “2-.” In this case it arrives at the circled entry 2-1, which means that the title for page 2 begins at word number 1. In the same way, the search engine can find out where the title for page 2 ends. It just scans along the entries for , looking for a number that starts with “2-,” and therefore stops at the circled entry 2-4. So page 2's title ends at word 4.

Everything we know so far is summarized by the circled entries in the figure, which tell us the title for page 2 starts at word 1 and ends at word 4, and the word “dog” occurs at word 3. The final step is easy: because 3 is greater than 1 and less than 4, we are certain that this hit for the word “dog” does indeed occur in a title, and therefore page 2 should be a hit for the query dog IN TITLE.

The search engine can now move to the next hit for “dog.” This happens to be 2-7 (the seventh word of page 2), but because we already know that page 2 is a hit, we can skip over this entry and move on to the next one, 3-11, which is marked by a box. This tells us that “dog” occurs at word 11 on page 3. So we start scanning past the current circled locations in the rows for and , looking for entries that start with “3-.” (It's important to note that we do not have to go back to the start of each row—we can pick up wherever we left off scanning from the previous hit.) In this simple example, the entry starting with “3-” happens to be the very next number in both cases—3-1 for and 3-4 for . These are both marked by boxes for easy reference. Once again, we have the task of determining whether the current hit for “dog” at 3-11 is located inside a title or not. Well, the information in boxes tells us that on page 3, “dog” is at word 11, whereas the title begins at word 1 and ends at word 4. Because 11 is greater than 4, we know that this occurrence of “dog” occurs after the end of the title and is therefore not in the title—so page 3 is not a hit for the query dog IN TITLE.

So, the metaword trick allows a search engine to answer queries about the structure of a document in an extremely efficient way. The example above was only for searching inside page titles, but very similar techniques allow you to search for words in hyperlinks, image descriptions, and various other useful parts of web pages. And all of these queries can be answered as efficiently as the example above. Just like the queries we discussed earlier, the search engine does not need to go back and look at the original web pages: it can answer the query by consulting just a small number of index entries. And, just as importantly, it only needs to scan through each index entry once. Remember what happened when we had finished processing the first hit on page 2 and moved to the possible hit on page 3: instead of going back to the start of the entries for and , the search engine could continue scanning from where it had left off. This is a crucial element in making the IN query efficient.

Title queries and other “structure queries” that depend on the structure of a web page are similar to the NEAR queries discussed earlier, in that humans rarely employ structure queries, but search engines use them internally all the time. The reason is the same as before: search engines live or die by their rankings, and rankings can be significantly improved by exploiting the structure of web pages. For example, pages that have “dog” in the title are much more likely to contain information about dogs than pages that mention “dog” only in the body of the page. So when a user enters the simple query dog, a search engine could internally perform a dog IN TITLE search (even though the user did not explicitly request that) to find pages that are most likely to be
about dogs, rather than just happening to mention dogs.

INDEXING AND MATCHING TRICKS ARE NOT THE WHOLE STORY

Building a web search engine is no easy task. The final product is like an enormously complex machine with many different wheels, gears, and levers, which must all be set correctly for the system to be useful. Therefore, it is important to realize that the two tricks presented in this chapter do not by themselves solve the problem of building an effective search engine index. However, the word-location trick and the metaword trick certainly convey the flavor of how real search engines construct and use indexes.

The metaword trick did help AltaVista succeed—where others had failed—in finding efficient matches to the entire web. We know this because the metaword trick is described in a 1999 U.S. patent filing by AltaVista, entitled “Constrained Searching of an Index.” However, AltaVista's superbly crafted matching algorithm was not enough to keep it afloat in the turbulent early days of the search industry. As we already know, efficient matching is only half the story for an effective search engine: the other grand challenge is to rank the matching pages. And as we will see in the next chapter, the emergence of a new type of ranking algorithm was enough to eclipse AltaVista, vaulting Google into the forefront of the world of web search.

‹ Prev Next ›