Human Error

Page 35

by James Reason

5.5. Ecological interface design

This recent development is a product of the extremely influential Risoe National Laboratory in Denmark (Rasmussen & Vicente, 1987; Vicente & Rasmussen, 1987). Its goal is to produce meaningful representations of plant processes that simultaneously support the skill-based, rule-based and knowledge-based levels of operator performance.

In order to identify those areas where design improvements are necessary, Rasmussen and Vicente focus upon four categories of error: (a) errors related to learning and adaptation, (b) interference among competing cognitive control structures, (c) lack of resources and (d) intrinsic human variability. These are also distinguished at the skill-based, rule-based and knowledge-based levels of performance.

Ten guidelines for improved system design are presented. As indicated earlier, the aim of these guidelines is not to remove error, but to increase the system’s error tolerance by giving operators more cognitively natural means for limiting their effects upon system performance. The guidelines are summarised below:

(a) Designers should accept that ‘experiments’ are necessary to optimise the sensorimotor skills of system users. Interface design should aim at making the boundaries of acceptable performance visible to the users while their effects are still observable and reversible.

(b) In general, the above guideline is only possible for direct dynamic interaction at the skill-based level. At the rule-based level, error observability is more difficult to achieve because (i) the effects of an error may be delayed and (ii) they may be rendered invisible by the defence-in-depth philosophy. Thus the designer should provide feedback to support functional understanding and knowledge-based monitoring during rule-based performance. It is also necessary to make visible latent constraints upon action.

(c) In addition, for performance at the rule-based level, a display should represent cues for actions not only as readily interpretable signs, but also indicating the preconditions for their validity. In other words, these signs should also have a symbolic content.

(d) To assist operators in coping with unforeseen situations (by definition at the knowledge-based level), designers should provide them with tools to make experiments and test hypotheses without having to do these things directly upon a high-risk and potentially irreversible plant. The alternative is to make the system states always reversible.

(e) To minimise the likelihood of attentional capture, designers should provide users with overview displays by which ‘free-running’ skill-based routines can be monitored on the fringes of conciousness.

(f) At the rule-based level, steps should be taken to reduce the possibility of falling into procedural traps (i.e., the activation of ‘strong-but-wrong’ rules). This can be done by giving integrated patterns as cues for action. These can also provide some degree of symbolic representation necessary for the functional monitoring of performance.

(g) At the knowledge-based level, reduce the chances of interference between possible competing ‘mental models’ by supporting memory with some externalised schematic of these alternatives.

(h) To aid recovery from errors due to lack of resources, use the available data to present information that is simultaneously suitable for skill-based, rule-based and knowledge-based processing.

(i) Causal reasoning in a complex functional network places excessive demands upon limited working memory resources. Information should be embedded in a structure that can serve as an externalised mental model. This representation should not aim at identifying a specific problem solution, but should aim at indicating an effective strategy (i.e., a category of possible solutions).

(j) Provide the user with external memory aids to support the retention of items, acts and data that are not part of the current operational ‘gestalt’.

Guidelines (a) to (d) are directed at errors associated with the learning process. Guidelines (e) to (g) are concerned with mitigating the effects of errors which arise from interference among cognitive control structures. Guidelines (h) and (i) are concerned with compensating for a lack of available cognitive resources. The final guideline seeks to minimise the effects of stochastic errors.

5.6. Self-knowledge about error types and mechanisms

The only satisfactory way to protect aircrew against the effects of in-flight disorientation is to (a) demonstrate directly the varieties of disorientation in both actual and simulated flight, and (b) instruct them as to the ways in which their earthbound senses provide them with erroneous position-and-motion information in three-dimensional flight. Perhaps our understanding of human error mechanisms has now progressed to a point where it may be possible to provide some useful degree of self-knowledge about human error mechanisms to those for whom the consequences of slips, lapses and mistakes are unacceptable. The operators of high-risk technologies are informed about likely system breakdowns and how to deal with them. Why should they not be told something about their own potential for error? They and their colleagues do, after all, constitute the major hazard to such systems. Moreover, the descriptions of human error mechanisms given in this book and elsewhere are couched in broad engineering terms that would not be alien to most operators of high-risk systems.

6. Epilogue

Events drive fashions, particularly in the study of human fallibility. In the 1960s, research involving human error had two quite separate faces. In academic laboratories, experimental psychologists (as they were then called) treated error for the most part as just another performance index. In the applied world, behavioural technologists, called human engineers and economists, strove to make cockpits and control rooms fit for humans to work in. Much of this work involved trying to mitigate the physical and psychological insults created by design engineers for whom the human-machine interface was often an afterthought, an area in which the bare control and display necessities were packed in as and where space could be found for them. And in both the academic and applied camps, the influences of Behaviourism were still very apparent. Their world was a fairly atheoretical place. Techniques were the thing.

The 1970s saw many major changes. Academics (who now preferred to call themselves cognitive rather than experimental psychologists) strayed beyond the laboratory door to study the natural history of everyday errors. They did so primarily to find out more about the covert control processes of human cognition. Since these slips and lapses were collected across a wide range of cognitive activities, it soon became apparent that the limited data-bound theories of experimental psychology were inadequate to explain their evident regularities. Such a diversity of phenomena called for broadly-based framework theories that were more akin to the working approximations of engineering than to the ‘binarisms’ (see Chapter 2) of natural science. They were greatly helped in these formulations by major developments in AI and cognitive science: for example, the General Problem Solver in 1972 and the rebirth of the schema concept in the mid-1970s (see Chapter 2).

Out in the ‘real’ world, a microchip revolution was taking place. The consequences for the human components of computerised systems were outlined in Chapter 7. Whereas the physical involvement of the human operator was significantly diminished, the imaginative scope for the design engineer was vastly enlarged. With the advent of soft displays on VDUs, the designer was freed from the physical constraints imposed by fixed display panels and large-scale mimic boards. In theory, at least, most interactions between the operator and the system could be confined to a screen and such input devices as keyboards, joysticks, rollerballs, mice and the like. In effect, the interface designer was provided with a blank sheet. But what should be put there?

Questions such as these, along with regulatory demands to build a safe system from the outset, posed problems that went far beyond the previous concerns of the 1960s ergonomist. Two new breeds emerged: the cognitive ergonomist, someone for whom the major focus of concern was human-computer interaction (or HCI) and the human reliability analyst, whose activities we considered at length earlier in this chapter. In the
beginning neither strayed too far from traditional issues. The HCI specialist worried about VDU resolution, screen layout, the use of colour and other formal issues. The HRA specialist dealt at a very surface level with the probabilities of misreadings, omissions and commission errors. Then came Tenerife, Flixborough and Three Mile Island.

For human factors specialists of all kinds, the TMI accident highlighted the distinction between slips and mistakes. In the immediate aftermath, the greatest emphasis was placed upon the diagnostic errors of the TMI-2 operators. These had important implications for both the HCI and the HRA practitioners. To the former, it was clear that these misdiagnoses had been aided and abetted by poor interface design. To the latter, it was clear that current probabilistic techniques were inadequate to capture the likelihood of operators basing their recovery actions upon an incorrect mental model of the plant state. At this point—in the early 1980s—the academics and the practitioners started talking to each other seriously, an interchange greatly facilitated by the existence of a theoretical lingua franca supplied by Jens Rasmussen’s skill-rule-knowledge framework.

One of the aims of this book has been to capture something of the excitement of this often stormy debate. The theorists have been enriched by good applied data, and, it is hoped, the practitioners have been helped by an increasingly consensual approach to applied cognitive theorizing. And that would have made a suitably happy ending to this book had not events once again overtaken us. The systems disasters of the mid- to late 1980s, and particularly Chernobyl, make it apparent that a purely cognitive theory concerned with the mental processes of an individual, no matter how widely held, is quite inadequate to explain the actions of the Chernobyl operators. As discussed at length in Chapter 7, these actions originated in organisational failings at all levels of the Soviet nuclear establishment. So to whom can we now turn for help in shaping a new organisational ergonomics? Sociologists? But they are an almost extinct species, at least in Mrs Thatcher’s Britain. To parody an old sixties song: Where have all the sociologists gone? Gone for managers every one.

Those concerned with maintaining the safety of complex, high-risk systems face a challenging time ahead. The engineers and technocrats who design and manage these installations have, to some extent, become the victims of their own success. Thanks to the effectiveness of engineered safety measures, complex technologies are now largely proof against single failures, either of humans or components. This is an enormous achievement. Yet it leaves these systems prey to the one hazard for which there is no technological remedy: the insidious concatenation of latent human failures that are an inevitable part of any large organisation.

Future studies of human error will need to encompass organisational as well as individual fallibility. As this book has tried to show, we are beginning to have some understanding of the cognitive bases of human error: but we still know very little about how these individual tendencies interact within complex groupings of people working with high-risk technologies. It is just these social and institutional factors that now pose the greatest threat to our safety. But perhaps it was always so.

Appendix

* * *

Case Study No. 1: Three Mile Island

Chain of events and active errors

Contributing conditions and latent failures

Maintenance crew introduces water into the instrument air system.

Although this error had occurred on two previous occasions, the operating company had not taken steps to prevent its recurrence.

(Management failure)

Turbine tripped. Feedwater pumps shut down. Emergency feedwater pumps come on automatically, but flow blocked by two closed valves.

The two block valves had been erroneously left in the closed position during maintenance, probably carried out two days prior to the accident sequence. One of the warning lights showing that valves were closed was obscured by a maintenance tag.

(Maintenance failures)

Rapid rise in core temperature and pressure, causing the reactor to trip. Relief valve (PORV) opens automatically, but then sticks in the open position. The scene is now set for a loss of coolant accident (LOCA) 13 seconds into the emergency..

During an incident at the Davis – Besse plant (another Babcock & Wilcox PWR) in September 1977, the PORV also stuck open. The incident was investigated by Babcock & Wilcox and the U.S. Nuclear Regulatory Commission. However, these analyses were not collated, and the information obtained regarding appropriate operator action was not communicated to the industry at large.

(Regulatory failure)

Operators fail to recognise that the relief valve is stuck open. Primary cooling system now has hole in it through which radio – active water, under high pressure, pours into the containment area, and thence down into basement.

1. Operators were misled by control panel indications. Following an incident 1 year earlier, an indicator light had been installed. But this merely showed whether or not the valve had been commanded shut: it did not directly reveal valve status. (Design and management failures)

2. Operators wrongly assumed that high temperature at the PORV drain pipe was due to a chronically leaking valve. The pipe temperature normally registered high.

(Management/procedural failure)

Operators failed to diagnose stuck – open PORV for more than 2 hours. The resulting water loss caused significant damage to the reactor.

1. The control panel was poorly designed with hundreds of alarms that were not organised in a logical fashion. Many key indications were sited on the back wall of the control room. More than 100 alarms were activated with no means of suppressing unimportant ones. Several instruments went off – scale, and the computer printer ran more than 2 hours behind events.

(Design and management failures)

2. Operator training, consisting largely of lectures and work in the reactor simulator, provided an inadequate basis for coping with real emergencies. Little feedback given to students, and training programme was insufficiently evaluated.

(Training and management failures)

The crew cut back the high–pressure injection (HPI) of the water into the reactor coolant system, thus reducing the net flow rate from around 1000 gallons/min to about 25 gallons/min. This “throttling’ caused serious core damage.

1. Training emphasised the dangers of flooding the core. But this took no account of the possibility of a concurrent LOCA.

(Training and management failures)

2. Following the 1977 Davis – Besse incident, the Nuclear Regulatory Commission issued a publication that made no mention of the fact that these operators had interrupted the HPI. The incident appeared under the heading of ‘Valve malfunction” not operator error”.

(Regulatory failure)

Case Study No. 2: Bhopal

Selected latent failures

Origins

1. System errors

Locating a high risk plant close to a densely populated area.

Government/Management

Poor emphasis on system safety. No safety improvements after adverse audits.

Management

No improvement in safety measures, despite six prior accidents.

Government/Management

Storing 10 times more methyl isocyanate (MIC) than was needed daily.

Management

Poor evacuation measures.

Government/Management

Safety measures not upgraded when plant switched to large scale storage of MIC.

Management

Heavy reliance on inexperienced operators and supervisors.

Management

Factory inspector’s warning on washing MIC lines neglected.

Management

Failure to release telex message on MIC treatment.

Management

2. Operator errors

Reduction in operating and maintenance staff.

Management

Using a nontrained superintendent for the MIC plant.
>
Management

Repressurising the tank when it failed to get pressurized once.

Management/Operator

Issuing orders for washing when MIC tank failed to repressurise.

Management/Operator

Not operating warning siren until leak became severe.

Management

Switching off siren immediately after starting it.

Management

Failure to recognise that pressure rise was abnormal.

Management/Operator

Failure to use empty MIC tank to release presssure.

Management/Operator

3. Hardware errors

Insufficient scrubber capacity.

Design

Refrigeration plant not operational.

Management/Maintenance

No automatic sensors to warn of temperature increase.

Design/Management

Pressure and temperature indicators did not work.

Management/Maintenance

Insufficient gas masks available.

Management

Flare tower was disconnected.

‹ Prev Next ›