Now, the computer can tell that it may have been in the middle of a transaction when it crashed, because the log contains some information. But there are four planned actions listed in the log. How can we tell which ones have already been performed on the database and which ones remain to be done? The answer to this question is delightfully simple: it doesn't matter! The reason is that every entry in a database log is constructed so that it has the same effect whether it is performed once, twice, or any other number of times.
The technical word for this is idempotent, so a computer scientist would say that every action in the log must be idempotent. As an example, take a look at entry number 2, “Change Zadie checking from $800 to $600.” No matter how many times Zadie's balance is set to $600, the final effect will be the same. So if the database is recovering after a crash and it sees this entry in the log, it can safely perform the action without worrying about whether it was already performed before the crash too.
Thus, when recovering after a crash, a database can just replay the logged actions of any complete transactions. And it's easy to deal with incomplete transactions too. Any set of logged actions that doesn't finish with an “end transaction” entry simply gets undone in reverse order, leaving the database as if the transaction had never begun. We'll return to this notion of “rolling back” a transaction in the discussion of replicated databases on page 134.
Atomicity, in the Large and in the Small
There is another way of understanding transactions: from the point of view of the database user, every transaction is atomic. Although physicists have known how to split atoms for many decades, the original meaning of “atomic” came from Greek, where it means “indivisible.” When computer scientists say “atomic,” they are referring to this original meaning. Thus, an atomic transaction cannot be divided into smaller operations: either the whole transaction completes successfully, or the database is left in its original condition, as if the transaction had never been started.
So, the to-do list trick gives us atomic transactions, which in turn guarantee consistency. This is a key ingredient in our canonical example: an efficient and completely reliable database for online banking. We are not there yet, however. Consistency does not, by itself, yield adequate efficiency or reliability. When combined with the locking techniques to be described shortly, the to-do list trick maintains consistency even when thousands of customers are simultaneously accessing the database. This does yield tremendous efficiency, because many customers can be served at once. And the to-do list trick also provides a good measure of reliability, since it prevents inconsistencies. Specifically, the to-do list trick precludes data corruption, but does not eliminate data loss. Our next database trick—the prepare-then-commit trick—will produce significant progress toward the goal of preventing any loss of data.
THE PREPARE-THEN-COMMIT TRICK FOR REPLICATED DATABASES
Our journey through ingenious database techniques continues with an algorithm we'll call the “prepare-then-commit trick.” To motivate this trick, we need to understand two more facts about databases: first, they are often replicated, which means that multiple copies of the database are stored in different places; and second, database transactions must sometimes be canceled, which is also called “rolling back” or “aborting” a transaction. We'll briefly cover these two concepts before moving on to the prepare-then-commit trick.
Replicated Databases
The to-do list trick allows databases to recover from certain types of crashes, by completing or rolling back any transactions that were in progress at the time of the crash. But this assumes all the data that was saved before the crash is still there. What if the computer's hard drive is permanently broken and some or all of the data is lost? This is just one of many ways that a computer can suffer from permanent data loss. Other causes include software bugs (in the database program itself or in the operating system) and hardware failures. Any of these problems can cause a computer to overwrite data that you thought was safely stored on the hard drive, wiping it out and replacing it with garbage. Clearly, the to-do list trick can't help us here.
However, data loss is simply not an option in some circumstances. If your bank loses your account information, you will be extremely upset, and the bank could face serious legal and financial penalties. The same goes for a stockbroking firm that executes an order you've placed, but then loses the details of the sale. Indeed, any company with substantial online sales (eBay and Amazon being prime examples) simply cannot afford to lose or corrupt any customer information. But in a data center with thousands of computers, many components (especially hard drives) fail every day. The data on these components is lost every single day. How can your bank keep your data safe in the face of this onslaught?
The obvious, and widely used, solution is to maintain two or more copies of the database. Each copy of the database is called a replica, and the set of all copies taken together is called a replicated database. Often, the replicas are geographically separated (perhaps in different data centers that are hundreds of miles apart), so that if one of them is wiped out by a natural disaster, another replica is still available.
I once heard a computer company executive describe the experiences of the company's customers after the September 11, 2001 terrorist attacks on the twin towers of New York's World Trade Center. The computer company had five major customers in the twin towers, and all were running geographically replicated databases. Four of the five customers were able to continue their operations essentially uninterrupted on surviving database replicas. The fifth customer, unfortunately, had one replica in each tower and lost both! This customer could only resume operations after restoring its database from off-site archival backups.
Note that a replicated database behaves quite differently to the familiar concept of keeping a “backup” of some data. A backup is a snapshot of some data at a particular time—for manual backups, the snapshot is taken at the time you run your backup program, whereas automated backups often take a snapshot of a system at a particular time on a weekly or daily basis, such as every morning at 2 a.m. In other words, a backup is a complete duplicate of some files, or a database, or anything else for which you need a spare copy.
But a backup is, by definition, not necessarily up-to-date: if some changes are made after a backup is performed, those changes are not saved anywhere else. In contrast, a replicated database keeps all copies of the database in sync at all times. Every time the slightest change is made to any entry in the database, all of the replicas must make that change immediately.
Clearly, replication is an excellent way to guard against lost data. But replication has dangers, too: it introduces yet another type of possible inconsistency. What are we going to do if a replica somehow ends up with data that differs from another replica? Such replicas are inconsistent with each other, and it may be difficult or impossible to determine which replica has the correct version of the data. We will return to this issue after investigating how to roll back transactions.
Rolling Back Transactions
At the risk of being a little repetitive, let's try to recall exactly what a transaction is: it's a set of changes to a database that must all take place to guarantee the database remains consistent. In the earlier discussion of transactions, we were mostly concerned with making sure that a transaction would complete even if the database crashed in the middle of the transaction.
But it turns out that sometimes it is impossible to complete a transaction for some reason. For example, perhaps the transaction involves adding a large amount of data to the database, and the computer runs out of disk space halfway through the transaction. This is a very rare, but nevertheless important, scenario.
A much more common reason for failing to complete a transaction relates to another database concept called locking. In a busy database, there are usually many transactions executing at the same time. (Imagine what would happen if your bank permitted only one customer to transfer money at any one time—
the performance of this online banking system would be truly appalling.) But it is often important that some part of the database remains frozen during a transaction. For example, if transaction A is updating an entry to record that Rosina is now friends with Jingyi, it would be disastrous if a simultaneously running transaction B deleted Jingyi from the database altogether. Therefore, transaction A will “lock” the part of the database containing Jingyi's information. This means the data is frozen, and no other transaction can change it. In most databases, transactions can lock individual rows or columns, or entire tables. Obviously, only one transaction can lock a particular part of the database at any one time. Once the transaction completes successfully, it “unlocks” all of the data it has locked, and after this point other transactions are free to alter the previously frozen data.
Deadlock: When two transactions, A and B, both try to lock the same rows—but in the opposite order—they become deadlocked, and neither can proceed.
At first this seems like an excellent solution, but it can lead to a very nasty situation that computer scientists call a deadlock, as demonstrated in the figure above. Let's suppose that two long transactions, A and B, are running simultaneously. Initially, as in the top panel of the figure, none of the rows in the database are locked. Later,
as shown in the middle panel, A locks the row containing Marie's information, and B locks the row containing Pedro's information. Some time after this, A discovers that it needs to lock Pedro's row, and B discovers that it needs to lock Marie's row—this situation is represented in the bottom panel of the figure. Note that A now needs to lock Pedro's row, but only one transaction can lock any row at one time, and B has already locked Pedro's row! So A will need to wait until B finishes. But B can't finish until it locks Marie's row, which is currently locked by A. So B will need to wait until A finishes. A and B are deadlocked, because each must wait for the other to proceed. They will be stuck forever, and these transactions will never complete.
Computer scientists have studied deadlocks in great detail, and many databases periodically run a special procedure for deadlock detection. When a deadlock is found, one of the deadlocked transactions is simply canceled, so that the other can proceed. But note that, just as when we run out of disk space in the middle of a transaction, this requires the ability to abort or “roll back” the transaction that has already been partially completed. So we now know of at least two reasons why a transaction might need to be rolled back. There are many others, but there's no need for us to go into details. The fundamental point is that transactions frequently fail to complete for unpredictable reasons.
Rollback can be achieved using a slight tweak to the to-do list trick: the write-ahead log must contain enough additional information to undo each operation if necessary. (This contrasts with the earlier description, in which we emphasized that each log entry contains enough information to redo the operation after a crash.) This is easy to achieve in practice. In fact, in the simple examples we examined, the undo information and redo information are identical. An entry like “Change Zadie checking from $800 to $600” can easily be “undone”—by simply changing Zadie's checking balance from $600 to $800. To summarize: if a transaction needs to be rolled back, the database program can just work backward through the write-ahead log (i.e., the to-do list), reversing each operation in that transaction.
The Prepare-Then-Commit Trick
Now let's think about the problem of rolling back transactions in a replicated database. The big issue here is that one of the replicas might encounter a problem that requires rollback, while the others do not. For example, it's easy to imagine that one replica runs out of disk space while the others still have space available.
A simple analogy will help here. Suppose that you and three friends would all like to see a recently released movie together. To make things interesting, let's set this story in the 1980s, before the days of e-mail, so the movie trip is going to have to be organized by telephone. How do you go about it? One possible approach is as follows. Decide on a day and time for the movie that work for you, and—as far as you know—are likely to be suitable for your friends too. Let's suppose you choose Tuesday at 8 p.m. The next step is to call one of your friends and ask if he or she is free on Tuesday at 8. If the answer is yes, you'll say something like “great, please pencil that in, and I'll call you back later to confirm.” Then you'll call the next friend and do the same thing. Finally, you call the third and final friend with the same offer. If everyone is available on Tuesday at 8, you make the final decision to confirm the event and call back your friends to let them know.
That was the easy case. What happens if one of the friends is not available on Tuesday at 8? In this case, you will need to “roll back” all of the work done so far and start again. In reality, you would probably call each friend and immediately propose a new day and time, but to keep things as simple as possible here, let's instead assume that you call each friend and say “Sorry, Tuesday at 8 is no good, please erase that from your calendar, and I'll get back to you soon with a new proposal.” Once this is done, you can start the whole procedure all over again.
Notice that there are two distinct phases in your strategy for organizing the movie outing. In the first phase, the date and time have been proposed but are not yet 100% certain. Once you find out that the proposal is feasible for everyone, you know that the date and time are now 100% certain, but everyone else does not. Therefore, there is a second phase in which you call back all of your friends to confirm. Alternatively, if one or more friends were unable to make it, the second phase consists of calling back everyone to cancel. Computer scientists call this the two-phase commit protocol; we'll call it the “prepare-then-commit trick.” The first phase is called the “prepare” phase. The second phase is either a “commit” phase or an “abort” phase, depending on whether the initial proposal has been accepted by everyone or not.
Interestingly, there is a notion of database locking in this analogy. Although we didn't explicitly discuss it, each of your friends makes an implicit promise when they pencil in the movie outing: they are promising not to schedule something else for Tuesday at 8. Until they hear back from you with a confirmation or cancellation, that slot on the calendar is “locked” and cannot be altered by any other “transaction.” For example, what should happen if someone else calls your friend, sometime after the first phase but before the second phase, to propose watching a basketball game on Tuesday at 8? Your friend should say something like “Sorry, but I might have another appointment at that time. Until that appointment is finalized, I can't give you a firm answer about the basketball game.”
Let's now examine how the prepare-then-commit trick works for a replicated database. The figure on the next page demonstrates the idea. Typically, one of the replicas is the “master” that coordinates the transaction. To be specific, suppose there are three replicas, A, B, and C, with A being the master. Suppose the database needs to execute a transaction that inserts a new row of data into a table. The prepare phase begins with A locking that table, then writing the new data into its write-ahead log. At the same time, A sends the new data to B and C, which also lock their own copies of the table and write the new data in their logs. B and C then report back to A on whether they succeeded or failed in doing this. Now the second phase begins. If any of A, B, or C encountered a problem (such as running out of disk space or failing to lock the table), the master A knows that the transaction must be rolled back and informs all replicas of this—see the figure on page 140. But if all the replicas reported success from their prepare stages, A sends a message to each replica confirming the transaction, and the replicas then complete it (as in the figure on the next page).
So far we have two database tricks at our disposal: the to-do list trick and the prepare-then-commit trick. What do they buy us? By combining the two tricks, your bank—and any other online entity—can implement a replicated database with atomic transactions. And this permits simultaneous, eff
icient service to thousands of customers, with essentially zero chance of any inconsistency or data loss. However, we have not yet looked into the heart of the database: how is the data structured, and how are queries answered? Our final database trick will provide some answers to these questions.
RELATIONAL DATABASES AND THE VIRTUAL TABLE TRICK
In all of our examples so far, the database has consisted of exactly one table. But the true power of modern database technology is unleashed in databases that have multiple tables. The basic idea is that each table stores a different set of information, but that entities in the various tables are often connected in some way. So a company's database might consist of separate tables for customer information, supplier information, and product information. But the customer table might mention items in the product table, because customers order products. And perhaps the product table will mention items in the supplier table, because products are manufactured from the suppliers' goods.
The prepare-then-commit trick: The master replica, A, coordinates two other replicas (B, C) to add some new data to the table. In the prepare phase, the master checks whether all replicas will be able to complete the transaction. Once it gets the all clear, the master tells all replicas to commit the data.
The prepare-then-commit trick with rollback: The top panel of this figure is exactly the same as in the previous figure. But during the prepare phase, one of the replicas encounters an error. As a result, the bottom panel is an “abort” phase in which each replica must roll back the transaction.
Let's take a look at a small but real example: the information stored by a college, detailing which students are taking which courses. To keep things manageable, the example will have only a handful of students and courses, but hopefully it will be clear that the same principles apply when the amount of data is much larger.
Nine Algorithms That Changed the Future: The Ingenious Ideas That Drive Today's Computers Page 15