Book Read Free

The Soul of a New Machine

Page 21

by Tracy Kidder


  Often, Guyer leaves around three in the morning. A few hours later, Veres comes in, and the first thing he does is to read Guyer's notes in the logbook and study the pictures Guyer has taken on the analyzers. Surprisingly often, that's all it takes. Veres knows right away what's wrong and how to fix it. Veres and Guyer make a marvelous debugging team, but only when they aren't working together.

  One time, Rasala came into the lab, gestured across the room at Veres and Guyer, and, smiling, said, "The two Jims." They were sitting face to face in front of Gollum, having an argument about some fine technical point. Guyer was saying, "No, no, no, no, no!" He proceeded to describe his own theory; his hands made bowls in the air, wrapped imaginary packages, scrubbed invisible windows. Veres sat still. You could see the muscles in his jaw flexing and unflexing. When Guyer's spiel wound down, Veres began his retort in a soft voice, telling Guyer that his theory was completely preposterous. Guyer interrupted him. Veres clamped his own mouth shut. Obviously, their temperaments don't always mix. Perhaps they have too much in common. "Guyer and I are very headstrong," says Veres. "And we each want to do it our own way. We get annoyed and lose our trains of thought if it's not being done our own way." Veres also says, later on, to gales of laughter from the other Hardy Boys: "Jim and I don't debug well together. He asks too many questions that I can't answer."

  Veres designed part of the IP, including some of the part known as the I-cache, and Guyer has made this piece of hardware his number-one suspect in almost every debugging case. 'There was a time," Veres will say later, "when if anything went wrong, Jim would take out the I-cache. If Gollum ran out of paper, if it went up in smoke, he would take out the I-cache I didn't appreciate

  Jim pulling the I-cache all the time."

  Taking out the I-cache means essentially disconnecting it from the rest of the machine. If the machine fails with the I-cache in place but works when the I-cache is taken out, then the problem must lie with the I-cache. That's the theory. But sometimes it's false. Taking out the I-cache makes Eagle very tolerant of faults in other pieces of hardware and in the microcode. "There was a time when everyone was claiming that the IP was wrong, and that put a lot of pressure on me," says Veres. "And from time to time I did end up proving that the problem was not on the IP." But, true, the IP and its cache have been found lying at the bottom of a number of devilish failures, and this bothers Veres. It does not matter that the IP is a much more complex piece of hardware than many others in Eagle, nor that when he helped to design it, Veres was truly a novice and pressed for time. Like everyone else in the group, he feels what Holberger calls "the peer pressure," which leads to an absolute determination not to be the one who fails. Under the circumstances, it is annoying to have the IP blamed every time something goes wrong. Veres helped to design this piece of the computer. It is, as he says, part of him now, and he doesn't like to see it get picked on unfairly.

  For all of their similarities, though, Guyer and Veres like and appreciate each other, and when alone, each will brag about the other's skill. And for short stretches they can in fact work together. They are doing so now. The second shift is just coming in. They're in that transitional period when the first shift, hot on the trail-of the problem and reluctant to leave it, and the second shift, eager to get into the case, mingle and work together for a time. Holberger and Veres brief Guyer on the problem. They tell him that at the point of failure the I-cache contains the wrong instruction at the right address. There are, they figure, two possible culprits, the IP or else the System Cache.

  Laughing about it, the three decide to interrogate the System Cache first.

  Early in the debugging someone wrote in the logbook that the Sys Cache was "working perfectly." Since then, when others have blamed failures on this board, Mike Ziegler, the man who designed it, has retorted by quoting that line. "The Sys Cache is working perfectly."

  "It was just a matter of degree after that," says Holberger, who once made the same claim for the IP. "How perfect was the System Cache? Occasionally, we had to make it more perfect."

  This is, of course, just another friendly way to compete: proving that the problem lies on a board that someone else designed. Although Guyer designed neither the System Cache nor the IP, he, too, has a small incentive to hope that this problem lies with the Sys Cache. He has, he says, "been beatin' pretty hard on the IP lately." He knows that Veres feels annoyed with him. So for once, Guyer doesn't even suggest that they disable the I-cache. Instead, by common consent, they start hooking their probes to the System Cache. They look at a few pictures but receive no immediate enlightenment. Tired out now, after more than ten hours in the lab, Holberger and Veres depart and Guyer is left alone with Gollum. He spends the night with it It is very late. As usual, Guyer is surrounded by logic analyzers, and he is peering at their screens, when suddenly he touches his mouth, wheels around, and starts flipping through one of the big books on the table. "A flash," he says.

  The diagnostic programmer, in writing Eclipse 21, has assigned data and instructions to specific mailboxes in the computer's memory system. From time to time, however, the programmer has switched the game, and has written what is known as "dirty code." What this means is that the program sometimes changes the contents of a mailbox; it moves an instruction, some data, or both from one mailbox to another. Guyer has been studying addresses. When the machine fails, the target instruction — the one it's supposed to execute — is supposed to be in the mailbox with the address 21766. Earlier on, somewhere back in the diagnostic program, Gollum successfully performs the same series of operations that it later fails to execute. And what is interesting is that when it does these operations correctly, the target instruction is actually located at mailbox number 21765. This is Guyer"s flash. He takes up the huge, bound logbook and records a hypothesis. The following is a rough translation:

  The diagnostic program originally puts the target instruction at address 21765, and then, sometime later on, it moves the target instruction to 21766. But the IP never gets word of the change, though the System Cache does. Now, sometime after the target instruction is switched from mailbox 21765 to 21766, the program directs Gollum to execute the instruction at 21766. The IP receives this command and looks through its cache. It says to itself, in effect, "Mailbox 21766? I've got that address and there's an instruction in it. Let's run it." But in the I-cache, the target instruction is still at 21765, and mailbox number 21766 contains an error message. In short, the I-cache contains an outdated piece of memory. Why didn't it get updated along with the other parts of the memory system? Maybe, Guyer writes, the System Cache is to blame. The System Cache is supposed to know exactly what is in the I- cache. If an instruction or data gets moved to a new address, the System Cache is supposed to tell the IP to throw away the outdated mailbox and get the new one, the one with the target instruction in it. Somewhere back in the program, Guyer figures, the System Cache lost track of what was in the I-cache. It forgot that the IP had the target instruction in mailbox 21765, and so, when the change was made in the location of the target instruction, it never told the IP to get rid of the old, outdated mailbox. Guyer likes this hypothesis. He records it with mounting enthusiasm; and describing it later, he repossesses the feeling, speaking rapidly, gesturing with both hands. Then he stops, puts his hands on the table, and says, "Of course, it was completely wrong."

  It's very late now, past midnight. Guyer has an analyzer hooked up to the bus — a transmission line, as it were — that carries signals to and from the memory system. He is tracking, somewhat at random, farther and farther back through the diagnostic program, taking pictures of addresses that get generated. Just before he knocks off, he comes across another occasion in which Gollum performs the JSR and Return and hits the target instruction successfully. In this instance, the target instruction is residing at mailbox number 21772, in the same "block" of addresses as it is located later on in the program when the computer fails. Now Guyer knows that in the course of this long diagnostic program,
Gollum is being sent to the same general area of memory, many times, to find the target instruction. There's no flash this time, though. Guyer feels "sort of neutral." "It's beginning to look more complex," he is thinking, on his way out of the lab.

  Veres arrives around dawn. He has made this a habit ever since he started to get the hang of debugging. Coming early these days assures him of several hours alone with Gollum.

  In the back of Veres's mind still lies a small suspicion that the problem might after all be noise. And now — much to Guyer's delight, when he finds out later on — it is Veres himself who disconnects the I-cache. Then he runs the program past the point of failure, and everything works. He puts the I-cache back in and once again Gollum fails. This doesn't prove that the IP is to blame, but it does tend to eliminate noise as a suspect, once and for all, because noise tends to cause failures unpredictably, with no discernable pattern.

  Veres turns to the logbook and Guyer's notes. Guyer's made some progress. He's proven that the program directs Gollum to the particular target instruction on a number of occasions. He's shown that the program changes the address of this target instruction many times, and also that the program is switching the contents of mailboxes in one fairly small region of memory. It looks to Veres like one of those problems involving a time bomb, and one way to tackle that sort of crock is to follow the dues back to the culprit. Veres sets out methodically. First, he finds out exactly where in the program the failure occurs. It's on the fourth pass, at the 158th iteration of the subtest that contains the misaddressed target instruction.

  Next, Veres examines addresses rather different from the ones Guyer sampled. These addresses are called "tags." The mailboxes in the machine's memory system are organized in neighborhoods, called "blocks." Each block contains 256 mailboxes. Like each mailbox, each block has a unique address, which is a number. This number is the tag. Veres sees in his analyzer that at the moment when the machine fails, the tag of the block of mailboxes in the I-cache is 21. But the tag for what should be the corresponding block in the System Cache is 45. The two tags should be the same. Veres goes searching to find out which one is right. The answer should reveal whether the culprit is the System Cache or the IP.

  Holberger has arrived by this time and has pulled up a chair. Veres, meantime, has set up the equipment for his search. He's hooked up a couple of logic analyzers in such a way that they will record the tag numbers in the System Cache and IP, both at the time of failure and in each of 256 previous ticks of the computer's clock. He's run the program all over again, up to the point of the failure, and now he's looking back, at one picture after another. This promises to be a long and tedious morning at Gollum. Holberger retreats, to work with Dave Epstein at Coke.

  Nothing turns up. By the time Guyer comes in, Veres hasn't found a single new clue. Holberger holds a short conference. "We need new ideas. We're gonna defer it."

  The Microkids have just delivered a new batch of rewritten microcode, and Guyer agrees to work on testing that for the evening. Veres goes home, with the two tag numbers — the 45 and the 21 —rolling around in his mind. Which one is right? It's a simple enough question, and obviously there's an answer. It's just a matter of finding a way to get it.

  "I think that in the lab Veres is the best of the college kids," Rasala says. "The reason I say that — he's driven. To solving a problem. His whole chemistry, his whole environment is that he'll just go after it and after it. His attitude is that if he can't get time alone on the machine to work on a problem, he'll come in at four in the morning. So he can do it his own way. And his own way is often right."

  Veres himself feels that by the time debugging begins, designs should be more nearly perfect than this one was. A bug in the logic of a design, though discovered and fixed in the lab, stands as a slight reproach to the designer. Not that Veres doesn't enjoy debugging. He just doesn't like mistakes. And as Rasala has noticed, he doesn't just work casually to find and fix a mistake, he attacks.

  So it's up early the next morning for Veres. He doesn't want any interference; he wants time alone with this bug. To Veres, debugging the machine — particularly the IP, the part that he helped to design — has become "a very personal thing A computer, to people who designed it — it's part of them. You can almost feel what's wrong with it." This problem feels like a time bomb. But how to find it?

  "I get quite a lot of work done in the morning while taking a shower," says Veres. "Showers are kinda boring things, all things considered." Now in the shower, before leaving for work, he conceives a new approach.

  Yesterday he tried to find out whether 21 or 45 was the correct tag by looking back from the point of failure. Evidently, though, the answer lies farther back than an analyzer can look. So why not start searching from the other direction, forward instead of backward? He's done something like this before, on a similar problem. He'll run the program up to the fourth pass, and then every time Gollum performs the JSR and Return, he'll have the machine stop so he can get a picture with the analyzer and a printout on the system console. It'll take a while to go through the program that way. It'll be like searching every luggage locker at Kennedy Airport. It's a good thing he's up early, because Holberger might not stand for this. It's not the most elegant approach, but sometimes there is no elegant approach. He's willing to try this, he's decided by the time he gets to the lab.

  A few hours later Holberger drives into Westborough. The sun is in his eyes this morning, and he wonders in a detached sort of way where it will be hitting his windshield when they finish this job. Debugging Eagle has the feel of a career in itself. Holberger isn't thinking about any one problem, but about all the various problems at once, as he walks into the lab. What greets him there surprises him. He shows it by smiling wryly. A great heap of paper lies on the floor, a continuous sheet of computer paper streaming out of the carriage at Gollum's system console. Stretched out, the sheet would run across the room and back again several times. You could fit a fairly detailed description of American history from the Civil War to the present on it. Veres sits in this midst of this chaos, the picture of the scholar. He's examined it all. He turns to Holberger. "I found it," he says.

  At the 122nd iteration of the subtest in question, the I-cache contains the block of instructions with tag number 21. Millions of ticks of the computer's clock and thousands of instructions later, at iteration 151 of the subtest, Veres has observed the System Cache instructing the IP to replace tag number 21 with tag number 45. The System Cache has proved its innocence. The IP, by inference, must have disobeyed the order, at iteration 158 the I-cache is caught in the act of still harboring tag number 21. "Which, I'm very sorry to say, is wrong," Veres says. The IP, his board, is the villain.

  Some problems are easy to find and hard to fix; some are hard to find and easy to fix; some go both ways. They have seen and will continue to encounter permutations of all three. This one was hard to find. It happens to be easy, almost trivial, to repair. Now Holberger and Veres know where the failure occurs. They move fast. Seen working at such a moment, they might remind you of a couple of airline pilots, in the cockpit of their big jet, preparing for takeoff—heroes of technique, flicking switches with both hands, reading dials, and talking to the tower all at once. Veres and Holberger play the program up to iteration 151, the place where the time bomb gets set, where the Sys Cache tells the IP to get rid of the invalidated block of instructions and to bring in tag 45. They hook up analyzers and look at a number of different kinds of pictures. When, in fairly short order, the crucial picture appears on the screen of one of the analyzers, they don't need to perform any exegesis at all.

  'There it is."

  "Yup."

  They see the IP throwing out tag number 45 and keeping the old, invalid tag 21. A few more pictures show that the IP is, quite literally, getting its signals crossed. The IP gets from the System Cache the signal to throw away tag 21, but before it can obey, the signal from the Sys Cache gets changed by another signal coming from an
other part of the machine. The solution lies in delaying the arrival of that second signal, so that the IP will always have time to clear out an old block before it's asked to do something else.

  The solution takes the material form of a circuit called a NAND gate, which reproduces the "not and" function of Boolean algebra. The part costs eight cents, wholesale. The NAND gate produces a signal. Writing up the ECO, Holberger christens this signal "NOT YET." He's very pleased with the name. Schematics he's seen from other companies use formal, technical names for signals. The Eclipse Group, by contrast, looks for something simple that fits and if they can't come up with something appropriate they're apt to use their own names. "NOT YET" perfectly describes what this signal does. That's the Eclipse Group's way, Holberger notes. It's the general approach that West has in mind when he says, "No muss, no fuss." It's also a way — a small one, to be sure — of leaving something of yourself inside your creations.

  This is fun. They install the NAND gate that produces the NOT YET signal, and Holberger writes in the logbook, "With this ECO installed, Eclipse 21 runs 10 passes." Just one more routine chore remains. They have to make certain that this change doesn't foul up something else in the machine. So they start rerunning all the other various diagnostic programs that Gollum has already passed, and everything is proceeding nicely, when all of a sudden the console starts scratching out a message.

 

‹ Prev