Where Wizards Stay Up Late
Page 14
Barker spent months debugging the machine. Ornstein was overseeing the design corrections in the prototype and making sure they got relayed back to Honeywell’s engineers. The next machine Honeywell was scheduled to deliver would be the first ruggedized 516 with all the design bugs worked out: IMP Number One destined for UCLA. “The sweat was to get working designs in time to ship,” said Barker. The summer was upon them.
Heart’s wariness about the hordes of curious graduate students drove BBN to conceptualize even greater measures of protection for the IMPs. In time, among the most creative things Heart’s team did was invent ways of obtaining critical operating data from the network IMPs—reading the machines’vital signs—unobtrusively from a distance. Heart wanted to be able to sit at a terminal in Cambridge and see what an IMP in Los Angeles, Salt Lake City or any place else was doing. BBN’s full implementation of remote diagnostic tools and debugging capabilities would later become a huge asset. When the network matured, remote control would enable BBN to monitor and maintain the whole system from one operations center, collecting data and diagnosing problems as things changed. Periodically, each IMP would send back to Cambridge a “snapshot of its status,” a set of data about its operating conditions. Changes in the network, minor and major, could be detected. Heart’s group envisioned someday being able to look across the network to know whether any machine was malfunctioning or any line was failing, or perhaps if anyone was meddling with the machines.
Nonetheless, BBN still hadn’t even gotten the prototype IMP to a state in which it could run operational code. And the programming team—Crowther, Walden, and Cosell—was now moving into difficult territory: the design of a flexible, or “dynamic” routing system, allowing alternative routing, so that packets would automatically flow around troubled nodes and links. A fixed-routing scheme would have been straightforward: You would send a packet with clear instructions to travel via x, y, and z points on the map. But if point y was knocked out, all traffic would be held up. And that would thwart one of the advantages of a network with multiple links and nodes.
The original request from ARPA had specified dynamic routing, without offering a clue as to how to make it work. Crowther had found a way to do it. He was building a system of dynamic-routing tables that would be updated every split second or so. These tables would tell the IMPs in which direction to forward each packet that hadn’t yet reached its destination. The routing tables would reflect such network conditions as line failures and congestion, and they would route packets the shortest way possible. Making this design actually work seemed a mind-bending problem—until Crowther came up with a simple, perfect set of code. Crowther “always had his head right down in the bits,” as Ornstein described Crowther’s uncanny, intuitive talent.
Crowther’s dynamic-routing algorithm was a piece of programming poetry. “It was incredibly minimalistic and worked astoundingly well,” Walden observed. Crowther was regarded by his colleagues as being within the top fraction of 1 percent of programmers in the world. On occasion the graceful minimalism of Crowther’s code wasn’t enough to handle the complexity of real-world systems. Other programmers would have to fine-tune what Crowther had created. But his core ideas were more often than not brilliant. “Most of the rest of us made our livings handling the details resulting from Will’s use of his brain,” Walden observed.
Flow control was another programming challenge. But when Kahn looked at Crowther’s code and saw how he had implemented control of the flow of packets from one side of the network to the other, he was worried. Messages between hosts were to be transmitted by the subnet over logical “links.” The subnet would accept one message at a time on a given link. Messages waiting to enter the subnet were stored in a memory buffer (a waiting area inside the machine), in a queue. The next message wasn’t sent until the sending IMP received an acknowledgment (similar to a return receipt) saying that the previous message had arrived error-free at the destination host. When one link was cleared and a new message could be sent, the subnet would notify the sending IMP by means of a special control signal that the BBN engineers called Ready for Next Message, or RFNM (pronounced RUFF-num). Messages in the sending host’s buffers, waiting for links to clear, were like patrons in a restaurant waiting for tables; RFNMs were the equivalent of the maître d’s announcing “Your table is ready.”
This meant it was impossible to send a continuous stream of messages over any single link through the system from one host to another. RFNM was a congestion-control strategy designed to protect the IMPs from overload, but at the expense of reducing service to the host. Kahn had studied the problem of congestion control even before the ARPA network project began. His reaction to Crowther’s solution was that the links and RFNMs would still allow fatal congestion of the network’s arteries. The IMP’s buffers would fill up, he said. You’d have incomplete messages sitting in the receiving IMPs waiting for their final packets to arrive so the entire message could be reassembled, and there would be no room for the packets to arrive.
Kahn’s reassembly lockup scenario has an analog in the shipping business. Say that a Toyota dealership in Sacramento orders sets of replacement engine blocks and pistons from a warehouse in Yokohama. Both items together are essential for the jobs the dealer has at hand. In the Yokohama harbor, freighters are loading large containers, all the same size, filled with products of many kinds. The engine blocks and pistons wind up in separate ships. When the container of engine blocks arrives in San Francisco, it is unloaded to a warehouse of containers whose contents are also partial shipments awaiting the arrival of other parts before being sent onward: components for television sets, pin blocks for pianos, and so on. When the freighter with the pistons arrives, it finds the warehouse full. Every later ship has the same problem: Nobody can unload; nothing can leave the warehouse. Deadlock. Solution? The Yokohama shipper agrees to call ahead next time to reserve space for all containers that go together. If space is unavailable, he waits until it becomes available before shipping.
Kahn also predicted another type of deadlock that might lead to loss of packets. He said that it would occur in heavy traffic within the subnet when the buffers of one IMP were filled with packets routed to another, and vice versa. A kind of standoff would result, in which neither one would be able to accept the other’s packets. The way the routing software had been written the IMPs would discard the packets.
Kahn and Crowther debated the gridlock question at length. Over points such as these Kahn’s conceptual views came into open conflict with the pragmatic bent of the rest of the IMP Guys and cracked open a wide disagreement between them. The rest of the team just wanted to get the network up and running on schedule. As the network grew they’d have time to improve its performance, work out problems, and perfect the algorithms.
But Kahn persisted. “I could see things that to me were obvious flaws,” he said. “The most obvious one was that the network could deadlock.” Kahn was certain the network would lock up, and he told Heart and the others so immediately. They argued with him. “Bob was interested in the theory of things and the math, but he wasn’t really interested in the implementation,” said Crowther. Crowther and Kahn began to talk it through, and the two had what Crowther described as “grand little fights.” The flow-control scheme wasn’t designed for a huge network, and with a small number of nodes Crowther thought they could get by with it.
Heart thought that Kahn was worrying too much about hypothetical, unlikely network conditions. His approach was to wait and see. Some of the others thought Kahn didn’t understand many of the problems with which they were grappling. “Some of the things he was suggesting were off-the-wall, just wrong,” said Ornstein. Kahn wanted to watch simulations of network traffic on a screen. He wanted to have a program that would show packets moving through the network. In fact, the packets would never move at a humanly observable speed; they’d be going “zip-zip” in microseconds and milliseconds. “We said, ‘Bob, you’ll never understand the problem
s looking at it that way.’” The other IMP Guys respected Kahn, but some believed he was going in the wrong direction. Gradually, they paid less attention to him. “Most of us in the group were trying to get Kahn out of our hair,” Ornstein said.
Heart scotched Kahn’s suggestion that they use a simulation. Heart hated to see his programming team spend time on simulations or on writing anything but operational code. They were already becoming distracted by something else he disliked—building software tools. Heart feared delay. Over the years he had seen too many programmers captivated by tool building, and he had spent a career holding young engineers back from doing things that might waste money or time. The people in Heart’s division knew that if they asked him for the okay to clock hours writing editors, assemblers, and debuggers, they would meet with stiff resistance, perhaps even a shouting match. So no one ever asked; they just did it, building tools when they thought it was the right thing to do, regardless of what Heart thought. This was software they would eventually need when the time came to test the system. All were customized pieces of programming, specifically designed for the ARPA project.
As summer peaked, a troubling problem loomed: BBN was still awaiting Honeywell’s delivery of the first production IMP, with all the debugged interfaces built to BBN’s specifications. The programming team had given up waiting and gone ahead with its work by loading a lower-grade development machine with a simulation program the team had designed to mimic the operations of the production model IMP and its I/O interfaces. Still, testing the software on the real machine was the preferred approach. And whenever the machine came in, Barker would first need time to debug it. The time left was dwindling. By late summer, the machine still hadn’t crossed BBN’s loading dock. Scheduled delivery of the IMP to California was now only a few weeks away, and BBN’s own reputation was on the line.
The Bug
Finally, about two weeks before Labor Day, Honeywell rushed the first ruggedized 516 IMP out its shop door and over to Cambridge. As soon as the machine touched the floor at BBN, Barker was ready to work on it. He powered up IMP Number One in the backroom.
Barker loaded the IMP diagnostic code. When he tried running it, nothing happened. The machine didn’t respond. On closer inspection, it was apparent that the machine BBN received was not what it had ordered. This 516 had few of the modifications that Barker and Ornstein had worked out painstakingly in debugging the prototype; in fact, it was wired just like the original dysfunctional prototype had been wired. With the deadline closing in, Barker had only one recourse: fix it at BBN. This time at least he already knew where every wire should go. With the machine sitting in the middle of the large room, Barker went to work implementing all of the design modifications necessary to make it a functioning IMP.
Within a few days, Barker had coaxed the machine to life. He managed to activate the IMP’s interfaces—whereupon the computer began crashing at random intervals. The randomness of the crashes was unusually bad. Intermittent problems of this sort were the devil. The IMP would run for anywhere from twelve hours to forty hours at a stretch, then die and be somewhere “off in the boonies.” What to do? Recalled Ornstein, “We couldn’t figure out what the hell was going on.”
As Labor Day approached, they pressured the IMP, putting it through as many hard tests as possible. It might run fine for twenty-four hours, then inexplicably die. Barker would look for a clue, chase what appeared to be a problem, fix it, and still the machine would crash again. With only a few days left before the delivery deadline, it looked like they were not going to make it.
Barker, who had been nursing the computer, suspected the problem was in the machine’s timing chain. It was just a hunch.
The IMP had a clock used by the operating system to keep time in the machine, not as humans would by marking seconds, minutes or hours, but counting time in 1-microsecond (one million ticks per second) increments—fast for its day but a hundred times slower than today’s personal computers. This clock provided a framework in which the IMP operated, and it regulated the computer’s many functions synchronously. In a communications system, messages arrive unannounced; signals interrupt the machine asynchronously. Like a telephone call in the middle of dinner, an incoming packet shows up on its own schedule at the IMP’s door and says, “Take me now.”
The computer had a sophisticated system for handling the incoming interruptions in a methodical manner, so as not to upset the synchronous operation of all its functions. If not properly designed, such synchronizers can be thrown off by an incoming signal occurring at just the wrong moment. Synchronizer bugs are rare. But when they occur and the synchronizer fails to respond properly to an interrupt, the consequences are profoundly disturbing to the machine’s total operation. One might call it a nervous breakdown; computer scientists have another term for it: The synchronizer goes into a “metastable” condition. “Under such circumstances,” Ornstein said, “the machine invariably dies in a hopelessly confused state—different each time.”
Ornstein knew all too well about synchronizer bugs. He had dealt with the problem in the computer he and Wes Clark had built a few years earlier in St. Louis. Ornstein was the author of some of the first published papers on the subject, and was one of the few people in the world who actually had any experience with this particular gremlin.
Their unpredictability made synchronizer bugs among the most frustrating of bugs because of the absence of any recognizable pattern to the resulting crashes. Unlike most other problems that could cause computers to crash, a synchronizer bug left behind virtually no useful forensic evidence that might point a diagnostician to the problem. In fact, the absence of clues was one of the most useful clues. Furthermore, the failures caused by this bug were so infrequent (only once every day or so even in full-bore tests), that it was impossible to detect any evidence on an oscilloscope. Only the most astute debuggers had any idea what they were dealing with.
This seemed to be the problem Ornstein and Barker had on their hands. But who knew, because you couldn’t actually trace it. What to do now? The Honeywell 516 had never been used in an application as demanding as the packet-switching network. It was a fast machine; the IMP Guys had chosen it precisely for its I/O capabilities. No one else was likely ever to see the problem in a typical application of the 516 computer. “If their machine died once a year,” Ornstein said, “they’d never notice. They would just restart.” But the IMP Guys were driving the machine hard. The flow of packets into and out of the IMP happened faster than the Honeywell designers had anticipated. The 516 machine didn’t seem capable of handling such traffic. Maybe BBN had been overly optimistic. Ornstein and Barker went to Honeywell and insisted that the manufacturer “dig out of the woodwork, way in the backroom” the designer of the 516 computer. He was a very smart guy, Ornstein had to admit, but at first the Honeywell man refused to admit that a metastable state was possible in the machine. He had never read Ornstein’s papers, and had never seen the problem.“Though filled with disbelief,” said Ornstein, he “at least understood what we were saying.”
Under normal conditions, the 516 would run for years without experiencing the synchronizer problem. However, under ARPA’s packet-switching network conditions, the machine was failing once every day or so. Try telling Frank Heart, Mr. Reliability, that he’d just have to live with that.
Ornstein and Barker huddled. It was only a guess that the IMP had a synchronizer problem. To test the hypothesis, Ornstein designed and wired an “aggravator” that deliberately produced data requests at what Barker called a “fierce rate.” It increased the probability of getting interruptions at the exact nanosecond that would reveal the problem. The aggravator had a knob that worked like a tuner. Using the knob, Ornstein and Barker could “tune” the timing of requests to bring in a signal perfectly out of kilter with the clock, the worst case. Then, using an oscilloscope, they observed the machine’s “heartbeat” and other internal functions.
The debugging crew went to work. The patte
rns they were looking for on the oscilloscope would be so faint as to be visible only in a darkened room. So with all the lights out in the IMP room and with all their diagnostic equipment and the Honeywell turned on, they watched, while fooling with the aggravator. The traces they saw on the scope were bright, regularly positioned, and steadily paced—the vital signs of a healthy machine.
Even with the aggravator, it took the debugging team quite a while to find what it was looking for. Still, every few minutes a very faint ghost trace flitted across the oscilloscope. Was that it? The fleeting trace was perhaps the only telltale sign that the crashes were caused by a timing problem: a synchronizer stalled in a metastable condition for a few nanoseconds too long. It was the computer equivalent of the one split second of confusion or indecision by a race car driver that ends suddenly in a fatal crash. The evidence seemed fairly incontrovertible, and Honeywell finally acknowledged it.
In the meantime, Barker designed a possible fix, and rewired the IMP’s central timing chain. When Barker brought the machine back up, loaded in his diagnostic code, and looked in the scope, the ghost traces were gone.
While Barker and Ornstein were reasonably certain that the problem was fixed, they had no way of knowing for sure unless the machine ran for a few consecutive days without crashing. And they didn’t have a few days. Heart had already approved shipping the first Interface Message Processor to California the next day. IMP Number One was almost out the door.
5
Do It to It Truett
Steve Crocker and Vint Cerf had been best friends since attending Van Nuys High School in L.A.’s San Fernando Valley. They shared a love for science, and the two spent more than a few Saturday nights building three-dimensional chess games or trying to re-create Edwin Land’s experiments with color perception.