The beauty of the random surfer trick is that, unlike the authority trick, it works perfectly well whether or not there are cycles in the hyperlinks. Going back to our earlier scrambled egg example (page 29), we can easily run a random surfer simulation. After several million visits, my own simulation produced the surfer authority scores shown in the figure above. Notice that, as with our earlier calculation using the authority trick, Bert's page receives a much higher score than Ernie's (28% vs. 1%)—despite the fact that each has exactly one incoming link. So Bert would be ranked higher in a web search query for “scrambled eggs.”
Now let's turn to the more difficult example from earlier: the figure on page 30, which caused an insurmountable problem for our original authority trick because of the cycle of hyperlinks. Again, it's easy to run a computer simulation of random surfers, producing the surfer authority scores in the figure above. The surfer authority scores determined by this simulation tell us the final ranking that would be used by a search engine when returning results: page A is highest, followed by B, then E, with C and D sharing last place.
Surfer authority scores for the earlier example with a cycle of hyperlinks (page 30). The random surfer trick has no trouble computing appropriate scores, despite the presence of a cycle (A –> B –> E –> A).
PAGERANK IN PRACTICE
The random surfer trick was described by Google's cofounders in their now-famous 1998 conference paper, “The Anatomy of a Large-scale Hypertextual Web Search Engine.” In combination with many other techniques, variants of this trick are still used by the major search engines. There are, however, numerous complicating factors, which mean that the actual techniques employed by modern search engines differ somewhat from the random surfer trick described here.
One of these complicating factors strikes at the heart of PageRank: the assumption that hyperlinks confer legitimate authority is sometimes questionable. We already learned that although hyperlinks can represent criticisms rather than recommendations, this tends not to be a significant problem in practice. A much more severe problem is that people can abuse the hyperlink trick to artificially inflate the ranking of their own web pages. Suppose you run a website called BooksBooksBooks.com that sells (surprise, surprise) books. Using automated technology, it's relatively easy to create a large number—say, 10,000—of different web pages that all have links to BooksBooksBooks.com. Thus, if search engines computed PageRank authorities exactly as described here, BooksBooksBooks.com might undeservedly get a score thousands of times higher than other bookstores, resulting in a high ranking and thus more sales.
Search engines call this kind of abuse web spam. (The terminology comes from an analogy with e-mail spam: unwanted messages in your e-mail inbox are similar to unwanted web pages that clutter the results of a web search.) Detecting and eliminating various types of web spam are important ongoing tasks for all search engines. For example, in 2004, some researchers at Microsoft found over 300,000 websites that had exactly 1001 pages linking to them—a very suspicious state of affairs. By inspecting these websites manually, the researchers found that the vast majority of these incoming hyperlinks were web spam.
Hence, search engines are engaged in an arms race against web spammers and are constantly trying to improve their algorithms in order to return realistic rankings. This drive to improve PageRank has spawned a great deal of academic and industrial research into other algorithms that use the hyperlink structure of the web for ranking pages. Algorithms of this kind are often referred to as link-based ranking algorithms.
Another complicating factor relates to the efficiency of PageRank computations. Our surfer authority scores were computed by running random simulations, but running a simulation of that kind on the entire web would take far too long to be of practical use. So search engines do not compute their PageRank values by simulating random surfers: they use mathematical techniques that give the same answers as our own random surfer simulations, but with far less computational expense. We studied the surfer-simulation technique because of its intuitive appeal, and because it describes what the search engines calculate, not how they calculate it.
It's also worth noting that commercial search engines determine their rankings using a lot more than just a link-based ranking algorithm like PageRank. Even in their original, published description of Google back in 1998, Google's cofounders mentioned several other features that contributed to the ranking of search results. As you might expect, the technology has moved on from there: at the time of writing, Google's own website states that “more than 200 signals” are used in assessing the importance of a page.
Despite the many complexities of modern search engines, the beautiful idea at the heart of PageRank—that authoritative pages can confer authority on other pages via hyperlinks—remains valid. It was this idea that helped Google to dethrone AltaVista, transforming Google from small startup to king of search in a few heady years. Without the core idea of PageRank, most web search queries would drown in a sea of thousands of matching, but irrelevant, web pages. PageRank is indeed an algorithmic gem that allows a needle to rise effortlessly to the top of its haystack.
4
Public Key Cryptography: Sending Secrets on a Postcard
Who knows those most secret things of me that are hidden from the world?
—BOB DYLAN, Covenant Woman
Humans love to gossip, and they love secrets. And since the goal of cryptography is to communicate secrets, we are all natural cryptographers. But humans can communicate secretly more easily than computers. If you want to tell a secret to your friend, you can just whisper in your friend's ear. It's not so easy for computers to do that. There's no way for one computer to “whisper” a credit card number to another computer. If the computers are connected by the internet, they have no control over where that credit card number goes, and which other computers get to find it out. In this chapter we'll find out how computers get around this problem, using one of the most ingenious and influential computer science ideas of all time: public key cryptography.
At this point, you may be wondering why the title of this chapter refers to “sending secrets on a postcard.” The figure on the facing page reveals the answer: communicating via postcards can be used as an analogy to demonstrate the power of public key cryptography. In real life, if you wanted to send a confidential document to someone, you would, of course, enclose the document in a securely sealed envelope before sending it. This doesn't guarantee confidentiality, but it is a sensible step in the right direction. If, on the other hand, you chose to write your confidential message on the back of a postcard before sending it, confidentiality is obviously violated: anyone who comes in contact with the postcard (postal workers, for example) can just look at the postcard and read the message.
This is precisely the problem that computers face when trying to communicate confidentially with each other on the internet. Because any message on the internet typically travels through numerous computers called routers, the contents of the message can be seen by anyone with access to the routers—and this includes potentially malicious eavesdroppers. Thus, every single piece of data that leaves your computer and enters the internet might as well be written on a postcard!
The postcard analogy: It's obvious that sending a postcard through the mail system will not keep the contents of the postcard secret. For the same reason, a credit card number sent from your laptop to Amazon.com can easily be snooped by an eavesdropper if it is not properly encrypted.
You may have already thought of a quick fix for this postcard problem. Why don't we just use a secret code to encrypt each message before writing it on a postcard? Actually, this works just fine if you already know the person to whom you're sending the postcard. This is because you could have agreed, at some time in the past, on what secret code to use. The real problem is when you send a postcard to someone you don't know. If you use a secret code on this postcard, the postal workers will not be able to read your message, but neither will the inte
nded recipient! The real power of public key cryptography is that it allows you to employ a secret code that only the recipient can decrypt—despite the fact that you had no chance to secretly agree on what code to use.
Note that computers face the same problem of communicating with recipients they don't “know.” For example, the first time you purchase something from Amazon.com using your credit card, your computer must transmit your credit card number to a server computer at Amazon. But your computer has never communicated before with the Amazon server, so there has been no opportunity in the past for these two computers to agree on a secret code. And any agreement they try to make can be observed by all routers on the internet route between them.
Let's switch back to the postcard analogy. Admittedly, the situation sounds like a paradox: the recipient will see exactly the same information as the postal workers, but somehow the recipient will learn how to decode the message, whereas the postal workers will not. Public key cryptography provides a resolution to this paradox. This chapter explains how.
ENCRYPTING WITH A SHARED SECRET
Let's start with a very simple thought experiment. We'll abandon the postcard analogy for something even simpler: verbal communication in a room. Specifically, you're in a room with your friend Arnold and your enemy Eve. You want to secretly communicate a message to Arnold, without Eve being able to understand the message. Maybe the message is a credit card number, but let's keep things simple and suppose that it's an incredibly short credit card number—just a single digit between 1 and 9. Also, the only way you're allowed to communicate with Arnold is by speaking out loud so that Eve can overhear. No sneaky tricks, like whispering or passing him a handwritten note, are permitted.
To be specific, let's assume the credit card number you're trying to communicate is the number 7. Here's one way you could go about it. First, try to think of some number that Arnold knows but Eve doesn't. Let's say, for example, that you and Arnold are very old friends and lived on the same street as children. In fact, suppose you both often played in the front yard of your family's house at 322 Pleasant Street. Also, suppose that Eve didn't know you as a child and, in particular, she doesn't know the address of this house where you and Arnold used to play. Then you can say to Arnold: “Hey Arnold, remember the number of my family's house on Pleasant Street where we used to play as children? Well, if you take that house number, and add on the 1-digit credit card number I'm thinking of right now, you get 329.”
The addition trick: The message 7 is encrypted by adding it to the shared secret, 322. Arnold can decrypt it by subtracting the shared secret, but Eve cannot.
Now, as long as Arnold remembers the house number correctly, he can work out the credit card number by subtracting off the house number from the total you told him: 329. He calculates 329 minus 322 and gets 7, which is the credit card number you were trying to communicate to him. Meanwhile, Eve has no idea what the credit card number is, despite the fact that she heard every word you said to Arnold. The figure above demonstrates the whole process.
Why does this method work? Well, you and Arnold have a thing that computer scientists call a shared secret: the number 322. Because you both know this number, but Eve doesn't, you can use the shared secret to secretly communicate any other number you want, just by adding it on, announcing the total, and letting the other party subtract the shared secret. Hearing the total is of no use to Eve, because she doesn't know what number to subtract from it.
Believe it or not, if you understood this simple “addition trick” of adding a shared secret to a private message like a credit card number, then you already understand how the vast majority of encryption on the internet actually works! Computers are constantly using this trick, but for it to be truly secure there are a few more details that need to be taken care of.
First, the shared secrets that computers use need to be much longer than the house number 322. If the secret is too short, anyone eavesdropping on the conversation can just try out all the possibilities. For example, suppose we used a 3-digit house number to encrypt a real 16-digit credit card number using the addition trick.
Note that there are 999 possible 3-digit house numbers, so an adversary like Eve who overheard our conversation could work out a list of 999 possible numbers, of which one must be the credit card number. It would take a computer almost no time at all to try out 999 credit card numbers, so we need to use a lot more than 3 digits in a shared secret if it is going to be useful.
In fact, when you hear about a type of encryption being a certain number of bits, as in the phrase “128-bit encryption,” this is actually a description of how long the shared secret is. The shared secret is often called a “key,” since it can be used to unlock, or “decrypt,” a message. If you work out 30% of the number of bits in the key, that tells you the approximate number of digits in the key. So because 30% of 128 is about 38, we know that 128-bit encryption uses a key that is a 38-digit number.1 A 38-digit number is bigger than a billion billion billion billion, and because it would take any known computer billions of years to try out that many possibilities, a shared secret of 38 digits is considered to be very secure.
There's one more wrinkle that prevents the simple version of the addition trick from working in real life: the addition produces results that can be analyzed statistically, meaning that someone could work out your key based on analyzing a large number of your encrypted messages. Instead, modern encryption techniques, called “block ciphers,” use a variant of the addition trick.
First, long messages are broken up into small “blocks” of a fixed size, typically around 10-15 characters. Second, rather than simply adding a block of the message and the key together, each block is transformed several times according to a fixed set of rules that are similar to addition but cause the message and the key to be mixed up more aggressively. For example, the rules could say something like “add the first half of the key to the last half of the block, reverse the result and add the second half of the key to the last half of the block”—although in reality the rules are quite a bit more complicated. Modern block ciphers typically use 10 or more “rounds” of these operations, meaning the list of operations is applied repeatedly. After a sufficient number of rounds, the original message is well and truly mixed up and will resist statistical attacks, but anyone who knows the key can run all the operations in reverse to obtain the original, decrypted message.
At the time of writing, the most popular block cipher is the Advanced Encryption Standard, or AES. AES can be used with a variety of different settings, but a typical application might use blocks of 16 characters, with 128-bit keys, and 10 rounds of mixing operations.
ESTABLISHING A SHARED SECRET IN PUBLIC
So far, so good. We've already found out how the vast majority of encryption on the internet actually works: chop the message up into blocks and use a variant of the addition trick to encrypt each block. But it turns out that this is the easy part. The hard part is establishing a shared secret in the first place. In the example given above, where you were in a room with Arnold and Eve, we actually cheated a bit—we used the fact that you and Arnold had been playmates as children and therefore already knew a shared secret (your family's house number) that Eve couldn't possibly know. What if you, Arnold, and Eve were all strangers, and we tried to play the same game? Is there any way that you and Arnold can set up a shared secret without Eve also knowing it? (Remember, no cheating—you can't whisper anything to Arnold or give him a note that Eve can't see. All communication must be public.)
At first this might seem impossible, but it turns out that there is an ingenious way of solving the problem. Computer scientists call the solution Diffie-Hellman key exchange, but we're going to call it the paint-mixing trick.
The Paint-Mixing Trick
To understand the trick, we're going to forget about communicating credit card numbers for a while, and instead imagine that the secret you would like to share is a particular color of paint. (Yes, this is a little weird
, but as we'll soon see, it's also a very useful way of thinking about the problem.) So now suppose that you are in a room with Arnold and Eve and each of you has a huge collection of various pots of paint. You are each given the same choice of colors—there are many different colors available, and each of you has many pots of each color. So running out of paint is not going to be a problem. Each pot is clearly labeled with its color, so it's easy to give specific instructions to someone else about how to mix various colors together: you just say something like “mix one pot of ‘sky blue' with six pots of ‘eggshell' and five pots of ‘aquamarine'.” But there are hundreds or thousands of colors of every conceivable shade, so it's impossible to work out which exact colors went into a mixture just by looking at it. And it's impossible to work out which colors went into a mixture by trial and error, because there are just too many colors to try.
Now, the rules of the game are going to change just a little bit. Each of you is going to have a corner of the room curtained off for privacy, as a place where you store your paint collection and where you can go to secretly mix paints without the others seeing. But the rules about communication are just the same as before: any communication between you, Arnold, and Eve must be out in the open. You can't invite Arnold into your private mixing area! Another rule regulates how you can share mixtures of paint. You can give a batch of paint to one of the other people in the room, but only by placing that batch on the ground in the middle of the room and waiting for someone else to pick it up. This means that you can never be sure who is going to pick up your batch of paint. The best way is to make enough for everybody, and leave several separate batches in the middle of the room. That way anyone who wants one of your batches can get it. This rule is really just an extension of the fact that all communication must be public: if you give a certain mixture to Arnold without giving it to Eve too, you have had some kind of “private” communication with Arnold, which is against the rules.
Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers Page 5