'What Do You Care What Other People Think?'

Page 19

by Richard P Feynman

Space Shuttle Main Engines (SSME)

During the flight of the 51-L the three space shuttle main engines all worked perfectly, even beginning to shut down in the last moments as the fuel supply began to fail. The question arises, however, as to whether—had the engines failed, and we were to investigate them in as much detail as we did the solid rocket booster—we would find a similar lack of attention to faults and deteriorating safety criteria. In other words, were the organization weaknesses that contributed to the accident confined to the solid rocket booster sector, or were they a more general characteristic of NASA? To that end the space shuttle main engines and the avionics were both investigated. No similar study of the orbiter or the external tank was made.

The engine is a much more complicated structure than the solid rocket booster, and a great deal more detailed engineering goes into it. Generally, the engineering seems to be of high quality, and apparently considerable attention is paid to deficiencies and faults found in engine operation.

The usual way that such engines are designed (for military or civilian aircraft) may be called the component system, or bottom-up design. First it is necessary to thoroughly understand the properties and limitations of the materials to be used (turbine blades, for example), and tests are begun in experimental rigs to determine those. With this knowledge, larger component parts (such as bearings) are designed and tested individually. As deficiencies and design errors are noted they are corrected and verified with further testing. Since one tests only parts at a time, these tests and modifications are not overly expensive. Finally one works up to the final design of the entire engine, to the necessary specifications. There is a good chance, by this time, that the engine will generally succeed, or that any failures are easily isolated and analyzed because the failure modes, limitations of materials, et cetera, are so well understood. There is a very good chance that the modifications to get around final difficulties in the engine are not very hard to make, for most of the serious problems have already been discovered and dealt with in the earlier, less expensive stages of the process.

The space shuttle main engine was handled in a different manner—top down, we might say. The engine was designed and put together all at once with relatively little detailed preliminary study of the materials and components. But now, when troubles are found in bearings, turbine blades, coolant pipes, et cetera, it is more expensive and difficult to discover the causes and make changes. For example, cracks have been found in the turbine blades of the high-pressure oxygen turbopump. Are they caused by flaws in the material, the effect of the oxygen atmosphere on the properties of the material, the thermal stresses of startup or shutdown, the vibration and stresses of steady running, or mainly at some resonance at certain speeds, or something else? How long can we run from crack initiation to crack failure, and how does this depend on power level? Using the completed engine as a test bed to resolve such questions is extremely expensive. One does not wish to lose entire engines in order to find out where and how failure occurs. Yet, an accurate knowledge of this information is essential to acquiring a confidence in the engine reliability in use. Without detailed understanding, confidence cannot be attained.

A further disadvantage of the top-down method is that if an understanding of a fault is obtained, a simple fix—such as a new shape for the turbine housing—may be impossible to implement without a redesign of the entire engine.

The space shuttle main engine is a very remarkable machine. It has a greater ratio of thrust to weight than any previous engine. It is built at the edge of—sometimes outside of—previous engineering experience. Therefore, as expected, many different kinds of flaws and difficulties have turned up. Because, unfortunately, it was built in a top-down manner, the flaws are difficult to find and to fix. The design aim of an engine lifetime of 55 mission equivalents (27,000 seconds of operation, either in missions of 500 seconds each or on a test stand) has not been obtained. The engine now requires very frequent maintenance and replacement of important parts such as turbopumps, bearings, sheet metal housings, et cetera. The high-pressure fuel turbopump had to be replaced every three or four mission equivalents (although this may have been fixed, now) and the high-pressure oxygen turbopump every five or six. This was, at most, 10 percent of the original design specifications. But our main concern here is the determination of reliability.

In a total of 250,000 seconds of operation, the main engines have failed seriously perhaps 16 times. Engineers pay close attention to these failings and try to remedy them as quickly as possible by test studies on special rigs experimentally designed for the flaw in question, by careful inspection of the engine for suggestive clues (like cracks), and by considerable study and analysis. In this way, in spite of the difficulties of top-down design, through hard work many of the problems have apparently been solved.

A list of some of the problems (and their status) follows:

Turbine blade cracks in high-pressure fuel turbopumps (HPFTP). (May have been solved.)

Turbine blade cracks in high-pressure oxygen fuel turbopumps (HPOTP). (Not solved.)

Augmented spark igniter (ASI) line rupture. (Probably solved.)

Purge check valve failure. (Probably solved.)

ASI chamber erosion. (Probably solved.)

HPFTP turbine sheet metal cracking. (Probably solved.)

HPFTP coolant liner failure. (Probably solved.)

Main combustion chamber outlet elbow failure. (Probably solved.)

Main combustion chamber inlet elbow weld offset. (Probably solved.)

HPOTP subsynchronous whirl. (Probably solved.)

Flight acceleration safety cutoff system (partial failure in a redundant system). (Probably solved.)

Bearing spalling. (Partially solved.)

A vibration at 4000 hertz making some engines inoperable. (Not solved.)

Many of these apparently solved problems were the early difficulties of a new design: 13 of them occurred in the first 125,000 seconds and only 3 in the second 125,000 seconds. Naturally, one can never be sure that all the bugs are out; for some, the fix may not have addressed the true cause. Thus it is not unreasonable to guess there may be at least one surprise in the next 250,000 seconds, a probability of 1/500 per engine per mission. On a mission there are three engines, but it is possible that some accidents would be self-contained and affect only one engine. (The shuttle can abort its mission with only two engines.) Therefore, let us say that the unknown surprises do not, in and of themselves, permit us to guess that the probability of mission failure due to the space shuttle main engines is less than 1/500. To this we must add the chance of failure from known, but as yet unsolved, problems. These we discuss below.

(Engineers at Rocketdyne, the manufacturer, estimate the total probability as 1/10,000. Engineers at Marshall estimate it as 1/300, while NASA management, to whom these engineers report, claims it is 1/10,000. An independent engineer consulting for NASA thought 1 or 2 per 100 a reasonable estimate.)

The history of the certification principles for these engines is confusing and difficult to explain. Initially the rule seems to have been that two sample engines must each have had twice the time operating without failure, as the operating time of the engine to be certified (rule of 2x). At least that is the FAA practice, and NASA seems to have adopted it originally, expecting the certified time to be 10 missions (hence 20 missions for each sample). Obviously, the best engines to use for comparison would be those of greatest total operating time (flight plus test), the so-called fleet leaders. But what if a third sample engine and several others fail in a short time? Surely we will not be safe because two were unusual in lasting longer. The short time might be more representative of the real possibilities, and in the spirit of the safety factor of 2, we should only operate at half the time of the short-lived samples.

The slow shift toward a decreasing safety factor can be seen in many examples. We take that of the HPFTP turbine blades. First of all the idea of testing an entire engine w
as abandoned. Each engine has had many important parts (such as the turbopumps themselves) replaced at frequent intervals, so the rule of 2x must be shifted from engines to components. Thus we accept an HPFTP for a given certification time if two samples have each run successfully for twice that time (and, of course, as a practical matter, no longer insisting that this time be as long as 10 missions). But what is “successfully”? The FAA calls a turbine blade crack a failure, in order to really provide a safety factor greater than 2 in practice. There is some time that an engine can run between the time a crack originally starts and the time it has grown large enough to fracture. (The FAA is contemplating new rules that take this extra safety time into account, but will accept them only if it is very carefully analyzed through known models within a known range of experience and with materials thoroughly tested. None of these conditions applies to the space shuttle main engines.)

Cracks were found in many second-stage HPFTP turbine blades. In one case three were found after 1900 seconds, while in another they were not found after 4200 seconds, although usually these longer runs showed cracks. To follow this story further we must realize that the stress depends a great deal on the power level. The Challenger flight, as well as previous flights, was at a level called 104 percent of rated power during most of the time the engines were operating. Judging from some material data, it is supposed that at 104 percent of rated power, the time to crack is about twice that at 109 percent, or full power level (FPL). Future flights were to be at 109 percent because of heavier payloads, and many tests were made at this level. Therefore, dividing time at 104 percent by 2, we obtain units called equivalent full power level (EFPL). (Obviously, some uncertainty is introduced by that, but it has not been studied.) The earliest cracks mentioned above occurred at 1375 seconds EFPL.

Now the certification rule becomes “limit all second-stage blades to a maximum of 1375 seconds EFPL.” If one objects that the safety factor of 2 is lost, it is pointed out that the one turbine ran for 3800 seconds EFPL without cracks, and half of this is 1900 so we are being more conservative. We have fooled ourselves in three ways. First, we have only one sample, and it is not the fleet leader: the other two samples of 3800 or more seconds EFPL had 17 cracked blades between them. (There are 59 blades in the engine.) Next, we have abandoned the 2x rule and substituted equal time (1375). And finally, the 1375 is where a crack was discovered. We can say that no crack had been found below 1375, but the last time we looked and saw no cracks was 1100 seconds EFPL. We do not know when the crack formed between these times. For example, cracks may have been formed at 1150 seconds EFPL. (Approximately two-thirds of the blade sets tested in excess of 1375 seconds EFPL had cracks. Some recent experiments have, indeed, shown cracks as early as 1150 seconds.) It was important to keep the number high, for the shuttle had to fly its engines very close to their limit by the time the flight was over.

Finally, it is claimed that the criteria have not been abandoned, and that the system is safe, by giving up the FAA convention that there should be no cracks, and by considering only a completely fractured blade a failure. With this definition no engine has yet failed. The idea is that since there is sufficient time for a crack to grow to fracture, we can ensure that all is safe by inspecting all blades for cracks. If cracks are found, replace the blades, and if none are found, we have enough time for a safe mission. Thus, it is claimed, the crack problem is no longer a flight safety problem, but merely a maintenance problem.

This may in fact be true. But how well do we know that cracks always grow slowly enough so that no fracture can occur in a mission? Three engines have run for long time periods with a few cracked blades (about 3000 seconds EFPL), with no blade actually breaking off.

A fix for this cracking may have been found. By changing the blade shape, shot-peening the surface, and covering it with insulation to exclude thermal shock, the new blades have not cracked so far.

A similar story appears in the history of certification of the HPOTP, but we shall not give the details here.

In summary, it is evident that the flight readiness reviews and certification rules show a deterioration in regard to some of the problems of the space shuttle main engines that is closely analogous to the deterioration seen in the rules for the solid rocket boosters.

Avionics

By “avionics” is meant the computer system on the orbiter as well as its input sensors and output actuators. At first we will restrict ourselves to the computers proper, and not be concerned with the reliability of the input information from the sensors of temperature, pressure, et cetera; nor with whether the computer output is faithfully followed by the actuators of rocket firings, mechanical controls, displays to astronauts, et cetera.

The computing system is very elaborate, having over 250,000 lines of code. Among many other things it is responsible for the automatic control of the shuttle’s entire ascent into orbit, and for the descent until the shuttle is well into the atmosphere (below Mach 1), once one button is pushed deciding the landing site desired. It would be possible to make the entire landing automatic. (The landing gear lowering signal is expressly left out of computer control, and must be provided by die pilot, ostensibly for safety reasons.) During orbital flight the computing system is used in the control of payloads, in the display of information to the astronauts, and in the exchange of information with the ground. It is evident that the safety of flight requires guaranteed accuracy of this elaborate system of computer hardware and software.

In brief, hardware reliability is ensured by having four essentially independent identical computer systems. Where possible, each sensor also has multiple copies—usually four—and each copy feeds all four of the computer lines. If the inputs from the sensors disagree, either a certain average or a majority selection is used as the effective input, depending on the circumstances. Since each computer sees all copies of the sensors, the inputs are the same, and because the algorithms used by each of the four computers are the same, the results in each computer should be identical at each step. From time to time they are compared, but because they might operate at slightly different speeds, a system of stopping and waiting at specified times is instituted before each comparison is made. If one of the computers disagrees or is too late in having its answer ready, the three which do agree are assumed to be correct and the errant computer is taken completely out of the system. If, now, another computer fails, as judged by the agreement of the other two, it is taken out of the system, and the rest of the flight is canceled: descent to the landing site is instituted, controlled by the two remaining computers. It is seen that this is a redundant system since the failure of only one computer does not affect the mission. Finally, as an extra feature of safety, there is a fifth independent computer, whose memory is loaded with only the programs for ascent and descent, and which is capable of controlling the descent if there is a failure of more than two of the computers of the main line of four.

There is not enough room in the memory of the mainline computers for all the programs of ascent, descent, and payload programs in flight, so the memory is loaded by the astronauts about four times from tapes.

Because of the enormous effort required to replace the software for such an elaborate system and to check out a new system, no change in the hardware has been made since the shuttle transportation system began about fifteen years ago. The actual hardware is obsolete—for example, the memories are of the old ferrite-core type. It is becoming more difficult to find manufacturers to supply such old-fashioned computers that are reliable and of high enough quality. Modern computers are much more reliable, and they run much faster. This simplifies circuits and allows more to be done. Today’s computers would not require so much loading from tapes, for their memories are much larger.

The software is checked very carefully in a bottom-up fashion. First, each new line of code is checked; then sections of code (modules) with special functions are verified. The scope is increased step by step until the new changes are incorpo
rated into a complete system and checked. This complete output is considered the final product, newly released. But working completely independently is a verification group that takes an adversary attitude to the software development group and tests the software as if it were a customer of the delivered product. There is additional verification in using the new programs in simulators, et cetera. An error during this stage of verification testing is considered very serious, and its origin is studied very carefully to avoid such mistakes in the future. Such inexperienced errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done. The principle followed is: all this verification is not an aspect of program safety; it is a test of that safety in a noncatastrophic verification. Flight safety is to be judged solely on how well the programs do in the verified tests. A failure here generates considerable concern.

To summarize, then, the computer software checking system is of highest quality. There appears to be no process of gradually fooling oneself while degrading standards, the process so characteristic of the solid rocket booster and space shuttle main engine safety systems. To be sure, there have been recent suggestions by management to curtail such elaborate and expensive tests as being unnecessary at this late date in shuttle history. Such suggestions must be resisted, for they do not appreciate the mutual subtle influences and sources of error generated by even small program changes in one part of a program on another. There are perpetual requests for program changes as new pay-loads and new demands and modifications are suggested by the users. Changes are expensive because they require extensive testing. The proper way to save money is to curtail the number of requested changes, not the quality of testing for each.

One might add that the elaborate system could be very much improved by modern hardware and programming techniques. Any outside competition would have all the advantages of starting over. Whether modern hardware is a good idea for NASA should be carefully considered now.

‹ Prev Next ›