by Tracy Kidder
West had known such exultation himself— the joy derived from mastering machines, both building and repairing them. My wife mentioned to West one day that our son's record player had broken down. "Where is it? I'll take it with me," snapped West, adding with a look that my wife found a little frightening, "I can fix anything." What the thing was, whether a car's engine or a computer, did not matter; but since computers were among the most complex of all man-made things, they had always seemed to him, he said, to pose interesting challenges. Eagle, in this regard, was something special.
In his office one relatively serene afternoon, West said he had heard that IBM had canceled plans for a certain new computer, because the machine promised to be so complex that any given engineer would need more than a lifetime to understand it fully. "I don't know why they didn't just build the thing and see what it would do," said West. Eagle's complexity fell far short of that mark, but it was complicated enough to defy single-handed efforts. "I always wanted to do something like this," he said. "Build something larger than myself.
"Among those who chucked the established ways, including me, there's something awfully compelling about this," West said of building Eagle. "Some notion of insecurity and challenge, of where the edges are, of finding out what you can't do, all within a perfectly justifiable scenario. It's for the kind of guy who likes to climb up mountains." A couple of engineers in the group had taken up the rather dangerous hobby of rock-climbing. West may have had them in mind, but it sounded as though he was thinking of the scaling of Everest. On the wall of West's office, beside his chair and a little above his head, were the pictures of some of those old computers, the machines that he could not bear to examine. Were they there to comfort him or to keep him nervous? "Everyone assumes that this one will work because all those others have," he said, rolling a pencil around in his thumbs. The fact of a string of previous successes, though, could imply the imminence of failure. "Realistically, you gotta lose one sometime," he said with a small smile.
I was lying in West's guest bedroom very late one night. His wife and two daughters had turned in long ago. On the perimeter of sleep, I heard West out in the living room take up his guitar and sing. He sounded rusty, but his voice, a tenor, could carry a tune nicely. He did not sing the sorts of songs that I gathered he played currently with his friends at their jam sessions, but once-popular folk songs — "The Banks of the O-hi-o" and the like. Those are seductive ballads. If you listen to them long enough, you can start believing that your way in life is strewn with possibilities.
THE CASE OF THE MISSING NAND GATE
Late in April, after the first deadline had come and gone, Ed Rasala was sitting in his cubicle, examining his third, revised debugging schedule. Down the hall past his open doorway came Dave Epstein, one of the best of the Hardy Boys, whose humorous look, a grin that almost envelops his eyes, can by itself provoke smiles. Epstein was carrying a wire-wrapped board. He was holding it in both hands like a tray, with the side containing all the wires facing up. The thing looked terrible, the very picture of a kludge. For out of the nest of wires on the board, three little strands of wire stood straight up in the air, attached to nothing, with little bits of adhesive tape wrapped like tiny flags to their loose ends.
Rasala looked up, saw Epstein and his cargo, closed his eyes, shook his head, and looked again. "Hey!" Rasala cried.
Epstein stopped. He poked his head in around the partition and, grinning, made as if to offer the terrible-looking board to Rasala.
Rasala put his hands on his desk and buried his face in them.
It was just another routine day down at debugging headquarters. In theory, it would be possible to test fully a computer like Eagle, but it would take literally forever to do so. The veterans in the Eclipse Group maintained that most computers never get completely debugged. Typically, they said, a machine gets built and sent to market and in its first year out in public a number of small, and sometimes large, defects in its design crop up and get repaired. As the years go by, the number of bugs declines, but although no flaws in a computer's design might appear for years, defects would probably remain in it — ones so small and occurring only under such peculiar circumstances that they might never show up before the machine became obsolete or simply stopped functioning because of dust in its chips. Big bugs in the logic of a machine, however, and even what might seem to be fairly small ones, have to be found and cast out in the lab. The hardware of modern computers is remarkably reliable, and needs to be. A computer like Eagle does a cycle of work in 220 billionths of a second. If it tended to fail only once every million cycles, it would be a very unreliable contraption indeed.
At first, the Eclipse Group worked on Eagle board by board, trying to make the machine functional in the most basic ways. This took months, and might have cost more time if Carl Alsing and Rasala had not finally persuaded West of the need for micro- diagnostic programs. Then they turned to the higher-level diagnostics, and the real fun began.
According to the Eclipse Group's theory of debugging a computer, you did not try to prove by exhaustive analysis that the machine was in all its details logically correct. You exercised the computer instead, and fixed it when it didn't work. The higher- level diagnostics had to provide the exhaustive analysis, in other words. They were crucial. They had to exercise Eagle strenuously. They had to be nasty, unfair and subtle, and not full of errors themselves.
A long list of programs that tested 16-bit Eclipses had been worked out over the years. The first ones on the list were fairly easy tests. They got progressively harder. Eagle would have to run all of them, and when it did they'd be able to say it was a bona ride 16-bit Eclipse. But it would also have to demonstrate that it- could be a 32-bit Eagle, of course, and the diagnostics that would test Eagle's 32-bit-hood didn't exist when they started debugging. This worried Rasala. "West has everything in the company moving on. The printed-circuit boards are getting built. If there is a major flaw in Eagle, I want to find out about it now and minimize the impact." Rasala wanted 32-bit diagnostics that were tough and thorough and he wanted them right away. But the Diagnostics support group was producing them very slowly. Rasala feuded with Diagnostics. Fear made him angry. He shouted and he threatened, but none of it seemed to do much good. Those diagnostics came in gradually.
In the lab one day a sign appeared. Someone must have plucked it from the side of a road. It bore a picture of the national bird and the inscription Eagle Associates, Real Estate. Another small device sat on top of Coke beside the microNova. This little thing had been designed as an aid to debugging Eagle, but no one ever got around to debugging the device itself. Remembering the assemble-it-yourself kits of electronic equipment that have been the starting points for many an engineering career, someone attached to the undebugged debugging device a label that said Heathkit. Soon these things lost their power as jokes and became part of the furniture of the lab. When they had started debugging, the wires on the backs of the boards had all been blue. They made their changes with red wires. The backs of the wire-wrapped boards got redder and redder. Slowly, painfully, Eagle was becoming an Eclipse.
Once in a while someone would find a problem, fix it temporarily, and move on. The engineer would plan to come back and make a proper repair but might forget about it, and weeks later it would cause some mysterious failure. Inevitably, this happened. They lost some time that way. Through the winter, the wire- wrapped boards held up well, but come April, constant handling was beginning to make some of the many mechanical connections unreliable. Wires and the sockets in which the chips were housed began coming loose once in a while, causing "flakey" failures. These occurred erratically and were often hard to diagnose; as the debuggers liked to say, it's hard to fix something when it's working. Here and there a bad chip impeded their progress. "Just stuff you never account for in a schedule," Rasala said. "You assume it's not gonna happen, and it always does."
In addition, the Eclipse Group's engineers were finding plenty of bugs in the
logic of their design. "We went with an imperfect design," said Rasala. "We knew we were pushing it." So his schedules slipped and slipped, and slipped again. "The way to stay on schedule," he said, "is to make another one." So far, though, West could say, as he almost always did when asked about the machine, "Nothing fatal yet." Rasala would say, "It's comin', it's comin'." Usually he would add, "There's still a good chance that we've totally blown it."
For many months, no single variety of problem fed his worry, just whatever the next diagnostic program happened to turn up. But Rasala told me early on, "I believe there's always a focal point in any machine." And by May, it had identified itself as far as he was concerned. It was the part of Eagle called the Instruction Processor, the IP. Perhaps this was, at last, the bogeyman.
West had helped to debug the first Eclipse and some of the subsequent models by using an oscilloscope to look inside the machines. "An oscilloscope," said Jim Veres of the Hardy Boys, "is what cavemen used to debug fire." The Hardy Boys had much better tools, computers essentially, to debug their new computer. Rasala liked to tell them, "You guys don't know what fun it is to bring up a machine." In fact, however, you probably couldn't build a computer of Eagle's class without the help of several functioning computers, and especially not without the help of the logic analyzers.
One crucial difference between Eagle and earlier machines involved the parts of Eagle known as accelerators. These were primarily the System Cache and the IP, both of which were designed in order to eliminate the bottleneck between the machine and its storage. Think of a program as a list of assembly-language instructions and data for the instructions to apply to. A computer without accelerators runs through this list in a halting manner. It must execute an instruction, then seek out in its memory or in peripheral-storage devices the next instruction, bring it back, figure out what it requires, and finally execute it. It can take a lot more time to retrieve the next instruction and prepare it for execution than it does actually to execute it. So if you can arrange to have those first two operations performed at the same time, you can greatly increase the speed of your computer.
That, approximately, was the theory behind the IP. While it is telling the computer what instruction to execute right now, the IP is, in a sense, making assumptions about what the next several instructions in a program will be. At any give time, the IP will have one instruction in execution, one instruction "decoded" and prepared for execution, one instruction in the process of being prepared and one or more instructions already retrieved and about to be decoded. The IP has a fairly small storage compartment of its own, in which it keeps instructions that are likely to be required next and also instructions that have been used recently. Computer programs tend to repeat themselves, or to "loop." In the best case, when a program loops, the IP already has the next instructions in its storage; therefore, it doesn't have to send out for them, and time is saved. The IP is a complex device.
The other main accelerator, the Sys Cache, also makes assumptions about what the computer will be asked to do next. It contains a storage compartment considerably larger than the IP's; like the IP's, the Sys Cache's storage consists of expensive memory chips that operate with great speed. Among other roles, the System Cache tries to keep handy commonly used instructions and data, so that if the IP doesn't have the necessary information to order up the next step in a program, it can get it quickly from the System Cache. In this situation, if what's needed is in the Sys Cache, time is saved; it would take much longer for the Sys Cache to get the necessary information out of Main Memory and then pass it on to the IP, because in Main Memory the chips operate at relatively slow speed and there are a great many instructions and data to sort through.
Accelerators represent one clever, well-known means of gaining speed, and they're well worth the hardware if what's needed is a fast computer. But from the point of view of debuggers, accelerators can be infernal devices, not because of the quantities but because of the kinds of problems they create. Inconsistency between them is one of the deadly species of crocks.
To visualize Eagle's memory system, imagine a funnel. At the narrow end are the IP's storage compartment and other small compartments located in other parts of the machine. The Sys Cache is a wider storage, and the boards of Main Memory are widest, the funnel's mouth. Main Memory holds a lot of data and instructions, including an exact copy of everything in the Sys Cache and in the IP. Exact copy is the crucial term.
The accelerators are forever throwing out blocks of information and bringing in new ones, making their assumptions according to certain fixed and clever rules. They must do a fair amount of internal housekeeping in order to make sure that they are always consistent with Main Memory and with each other. If, for instance, a given block of instructions is residing in both the IP and Sys Cache, both must contain identical copies of that block.
But suppose the Sys Cache is changing the contents of its storage compartment, and at just the wrong moment some electronic event intervenes, in some unforeseen way. Suppose further that this electronic event causes one or the other of the accelerators to err in its internal housekeeping. And suppose this error causes the IP to contain a block of instructions that "looks" the same but is in fact slightly different from the one that the System Cache contains. When this happens, the machine is prepared to fail. There's a time bomb set inside it. The IP is going to order the rest of the machine to execute the wrong instruction, sooner or later. Usually, it's later, and that's the rub. The machine will fail while running a diagnostic program, the debuggers will hook up their analyzers and get pictures of it failing, and nothing they see will account for the failure, because the real problem, the ticking bomb, was set somewhere far back in time, along the winding road of that diagnostic program.
This was the worst sort of problem that they were encountering that spring.
Early one morning in the middle of May, Chief Sergeant Detective of the Hardy Boys Ken Holberger turns his brown Saab down the road toward the red brick fortress of Westborough. It's a cloudy, hazy day. He notices, though, that muted sunlight comes in at the top of his windshield. Back when the debugging began, it was dark when he drove in and dark when he left. The slow advance of the morning sun across his windshield, when he makes the turn into the parking lot, has been one of the principal ways by which he has kept track of outside, planetary time. As for other events in the world at large — wars, famines, rock concerts — he is a little out of touch. When he goes home after a day on a difficult case, and sits down in his armchair and picks up the newspaper, he can't even read. He just stares at the front page.
Most of the group are working too hard now. Around this time, one of the Hardy Boys tells his wife that Data General provides its employees with alimony benefits, as well as medical insurance. And the funny part is, his wife believes him. "We're deep in the debug. Yeah, underground," says Holberger. "Bum-out city," he says.
Holberger wears a trim black beard. In his routine actions, the way he walks down a hall, the shape of his mouth when he gets ready to talk, Holberger seems to embody assertion, as West does, but more smoothly than West. Holberger has the appearance of one to whom all things come easily. You have the feeling that he couldn't look messy if he tried. You'd never see a vinyl penholder in his breast pocket, and you won't find him hanging around the basement after work to play Adventure or discuss machines. He says he doesn't spend any time thinking about what people do with computers. "A sense of the applications is somewhat missing, but it doesn't matter," he says, and he smiles slightly. "We say the ultimate goal is to build a machine to run a multiprogramming reliability test. But I understand that people who buy computers do run other programs on them, like Adventure and Star Trek and things."
Holberger joined the company three years ago and since then has risen from the status of a recruit to a position of importance; he has chief responsibility under Rasala for the details of the hardware of this major CPU. He is the right man for the job; by general consensus, he is the
only member of the group with anything like a complete understanding of the new computer's hardware.
Holberger and Rasala are good friends. Rasala seems to feel something of an older brother's admiration and affection toward Holberger. "Holberger's a sharp cat," says Rasala. The faults that Rasala finds in him have nothing to do with ability, but with the fact that Holberger sometimes moves too quickly and makes careless mistakes.
And there is the problem of Holberger's style in the lab, for which Rasala is willing to accept much of the responsibility. Holberger is known as one of the tough guys in the basement. In part, this is the result of long hours and strain. Holberger says that when he can hear his mind buzzing from caffeine he tends to be his most abrupt. He also says he rarely brings frustration home; he gets it out at work, and when he's really angry, he takes it out on Rasala, because he knows that's safe. He doesn't waste time listening to people who aren't making good, relevant sense to him, just in order to be polite. If he's working on a problem with several other engineers and feels that too many are involved, he'll simply ignore what one or two of them have to say and eventually they'll get angry and go elsewhere. In local parlance, he's a "gunslinger," one who "shoots from the hip." If you can't get what you need from some manager at your level in another department, go to his boss — that's the way to get things done. He learned this style from Rasala, who learned it from West. They take the same general attitude toward their work. "It doesn't matter how hard you work on something," says Holberger. "What counts is finishing and having it work."