by James Reason
In their original description of the OATS technique, Wreathall and his coworkers (Hall et al., 1982) tentatively suggested that different time-reliability curves could be drawn for skill-based, rule-based and knowledge-based performance on the basis of time-reliability correlations developed by Joe Fragola. This idea has been developed by Hannaman and coauthors (1984) in the form of the human cognitive reliability model, or HCR.
HCR is predicated on the assumption that different kinds of cognitive activity take different times to execute. Its outputs are the time-dependent nonresponse probabilities of NPP operators confronting an abnormal state of the plant. Whereas OATS comprises only a single time-reliability relationship, HCR provides sets of curves, each one relating to a different kind of cognitive processing (SB, RB or KB), for modelling specific situations (e.g., steam generator tube rupture). As such, it provides an assessment of the probability of error persistence against time.
Reviewers of the HCR correlational technique (Senders et al., 1985; Embrey & Lucas, 1989) have commended a number of its features:
(a) It is a quick and relatively convenient technique to apply.
(b) It takes account of the time-dependent nature of operator actions.
(c) Good fits have been found between the model and observed completion times in simulator studies.
(d) It covers knowledge-based behaviour as well as the more usually modelled skill-based and rule-based levels of performance.
(e) Its input data (times available from the onset of the emergency) are the same as those used by hardware reliability assessment techniques.
While the HCR model predicts time to task completion, it does not, of itself, constitute a model of error. As Senders and his coauthors noted (1985, p. 44): “It uses models or data on error as its input, rather than producing such data as an output. Specifically, the occurrence of error increases time to completion: but the model does not predict when or how often such an error will occur.”
According to Embrey and Lucas (1989), a major shortcoming of the HCR correlation is its limited focus upon nonresponse probabilities. Many of the critical operator errors observed in NPP emergencies involve commission as well as omission errors, for example, the rapid selection of wrong courses of action or the deliberate violation of established procedures. Embrey and Lucas (1989) also point out that the rules for assigning tasks to the various performance levels (SB, RB and KB) do not take account of the rapid switching between these levels apparent during the course of actual events (see Woods, 1982). Similarly, it is not easy to determine whether the nonresponse probabilities (obtained from the HCR curves) are due to the slow processing of information, or to a failure to detect the onset of the emergency. These are psychologically different processes, and it seems unlikely that they could be described by the same time-available/nonresponse probability curves.
2.3. Empirical technique to estimate operators’ errors (TESEO)
TESEO is an acronym for the Italian name: tecnica empirica stima errori operatori. It was developed by the Reliability Research Group of ENTE Nazional Idrocarburi (Bello & Colombari, 1980) from interview data collected in petrochemical process plants, but it is also applicable to nuclear process plants.
TESEO yields the probability of operator failure through the combined application of five error probability parameters, K1 to K5:
K1 = type of activity (routine or not routine, requiring close attention or not): probability parameters between 0.001 and 0.1.
K2 = a temporary stress factor for routine activities (assigned according to the time available): parameters between 10 and 0.5; a temporary stress factor for nonroutine activities (again depending upon the available time): parameters between 10 and 0.1.
K3 = operator qualities (assigned according to selection, expertise and training): parameters between 0.5 and 3.
K4 = an activity anxiety factor (dependent upon the situation, either a grave emergency, potential emergency or nominal conditions): parameters between 3 and 1.
K5 = an activity ergonomic factor (according to the quality of the microclimate and plant interface): parameters between 0.7 and 10.
Kletz (1985, p. 80)) gives an example of how the technique is applied in practice: “Suppose a tank is filled once a day and the operator watches the level and closes a valve when it is full. The operation is a very simple one, with little to distract the operator who is out on the plant giving the job his full attention.” Assigning values for the five parameters in this situation, we get: K1 = 0.001; K2 = 0.5; K3 = 1; K4 = 1; K5 = 1; giving a predicted failure rate of 1 in 2,000 occasions—roughly once every six years.
Hannaman and coauthors (1984) judge the mathematical framework of this model to be generally useful for quantifying human reliability in specific process situations. It is relatively simple to use and its output compares reasonably well with the assessments of expert judges. Once again, though, its numerical basis is derived from informed guesses rather than hard data.
2.4. Confusion matrix
The confusion matrix was devised by Potash and his coworkers (1981) as a means of evaluating the errors of operators responding to abnormal plant conditions. It has been used for this purpose in two PRAs on American nuclear power plants: Oconee (Oconee PRA, 1984) and Seabrook (Pickard, Lowe & Garrick, 1983). Its relatively unique feature is that its seeks to identify various modes of misdiagnosis for a range of possible events (see Swain & Weston, 1988).
The method relies upon the judgements of experts (usually the training staff of the plant in question) as to the likelihood of different misdiagnoses of specific critical plant states. These judgements are solicited in a structured and systematic fashion, allowing for the evaluation of probabilities at different times during a given accident sequence. Thus, its outputs represent the probabilities that operators will fail to respond correctly to events A, B, C, etc., at times t1, t2 ...tn after the initiation of the sequence. In giving their judgements, the experts are encouraged to take account of such factors as the overlap of symptoms between different events, the operators’ expectations based on their previous experience, the effects of stress and the general ergonomic quality of the control room.
The principal advantage of this technique is that it provides a simple structure for helping analysts to identify situations not easily modelled by other HRA methods. It appears to have more value as a qualitative analytical tool than as a quantitative one. Considerable disagreements arise between the probability estimates of different experts. It also shares with other techniques the weakness of being based upon simplistic manipulations of subjective data that, in this case, are low-value absolute probabilities.
2.5. Success likelihood index methodology (SLIM)
The SLI methodology (Embrey, Humphreys, Rosa, Kirwan & Rea, 1984), like the confusion matrix, was developed to provide a means of eliciting and structuring expert judgements. The software products that support this methodology allow experts to generate models that connect error probabilities in a specific situation with the factors that influence that probability. The underlying rationale is that the likelihood of an error occurring in a particular situation depends upon the combined effects of a relatively small number of performance influencing factors (PIFs). This is a somewhat less behaviourist variant of the performance shaping factors (PSFs) used in THERP. The success likelihood index (SU) is derived from a consideration of the typical variables known to influence error rates (i.e., quality of training, procedures and time available for action). It is also assumed that judges can give numerical ratings of how good or bad these PIFs are in a given situation. The relative importance weights and ratings are multiplied together for each PIF, and the products are summed to give the success likelihood index. This Index is presumed to relate to the probability of success that would be observed over the long run in the particular situation of interest.
The SLI methodology has a number of attractive features. It is available in the form of two comprehensive, interactive software packages: SLIM-SA
M (SLIM assessment module), which derives the success likelihood indices, and SLIM-SARAH (sensitivity analysis for reliability assessment of humans), which allows additional sensitivity and cost-benefit analyses to be performed. In order to establish the independence of the PIFs (an important assumption of the underlying model), the SLIM-SAM software checks the degree of shared variance between the ratings generated by the judges and informs the user if the ratings on two PIFs are correlated. In addition, up to 10 tasks can be evaluated within a single SLIM session. This substantially reduces the call upon the expert’s time.
At present, there are some difficulties with the calibration of SLIM. A basic assumption is that SLIM may be calibrated with reference to the linear equation: log HEP = a SLI + b (where HEP is human error probability). In theory, error probabilities can be obtained by reference to two calibration tasks whose error probabilities are objectively known. However, it turns out that the choice of these reference tasks is critical, and the equation underlying the linear function of calibration has not been widely accepted. In addition, as will be discussed later, SLIM has not fared particularly well in independent validation studies. But this, as we shall see, is not unique to SUM.
Other human reliability assessment techniques that have emerged from this same prolific source include: STAHR (socio-technical assessment of human reliability), which uses influence diagrams to assess the effects upon error rates of complex socio-technical factors such as morale, organizational features and group dynamics (Phillips, Humphreys & Embrey, 1984); and SCHEMA (system for critical human error management and assessment) which also applies the basic SLIM methodology to a wide variety of process plant operations (Embrey & Lucas, 1989). To date, both of these techniques are still in their developmental stages, but they have attracted considerable interest within the human reliability community.
2.6. Systematic human action reliability procedure (SHARP)
It must be evident from even this far from comprehensive list of HRA methods that the practitioner faces a considerable problem in deciding which technique to employ, and when and where to apply it. To ease these difficulties, Hannaman and coworkers (1984) have devised a procedure with the acronym SHARP (systematic human action reliability procedure) “to help PRA analysts incorporate human interactions into a PRA study in a systematic, comprehensive and scrutable manner.” SHARP, as they point out, is neither a model nor a technique, but a means for guiding the selection of appropriate HRA models and techniques. Specifically, it indicates the available options with regard to the representation of operator actions (THERP, OATS, etc.) and indicates the kinds of models or data that underlie the various HRA techniques: human reliability data bases, time-reliability curves, mathematical models, or expert judgements of human reliability.
3. How good are HRA techniques?
3.1. Validation studies
A relatively comprehensive review of the very few studies designed to compare the validity of HRA techniques was carried out by Williams (1985). He strikes the predominant note in his opening sentence: “It must seem quite extraordinary to most scientists engaged in research into other areas of the physical and technological world that there has been little or no attempt made by human reliability experts to validate the human reliability assessment techniques which they so freely propagate, modify and disseminate” (p. 150). As one explanation for this embarrassing lack, he suggests that HRA specialists are too busy generating additional models. But there are other alternatives: “an aversion to validation in case the outcome is unattractive, political or economic pressures which dissuade modellers from exploring the validation loop, shortage of sufficient personnel to carry out validation exercises, and misguided research methodologies.”
In a ‘peer review’ of THERP, Brune, Weinstein and Fitzwater (1983) asked 29 human factors experts to carry out human reliability analyses covering a range of possible performance scenarios in a nuclear power plant setting. For any one scenario, they found a wide variation in the problem solutions developed by the experts. There were also large differences, as much as five orders of magnitude on some occasions, between their estimated probabilities of failure. This lack of interanalyst reliability continues to be a significant problem in current attempts to apply THERP to the design of the new British PWR at Sizewell ‘B’ (Jackson, Madden, Umbers & Williams, 1987; Whitworth, 1987).
How can one gauge the accuracy of HRA methods? Williams (1985) suggests that we should be looking for outputs that are at least comparable to PRA methods in their predictive precision. This means that HRA methods should produce predictions that are accurate to within a factor of 4 on about 90 per cent of occasions; or, somewhat more generously, we should require predictions that are accurate on 100 per cent of occasions within 1 order of magnitude (a factor of 10).
The Williams review covered a number of the techniques discussed above (THERP, TESEO, SLIM and OATS). In addition to other methods not mentioned here, it also included the very simple technique of absolute probability judgement (APJ) in which judges assign numerical failure likelihoods directly to the tasks being assessed. It is then possible to represent the absolute probability judgement for any given task by the median for the estimates of the different judges. In essence, it means taking the average of a set of informed guesses.
The overall conclusions of the review were that if high-risk technologists are looking for an HRA method that is comparable in terms of its predictive accuracy to general reliability assessments, then APJ is probably the best. Kirwan (1982) found that median absolute probability judgements were accurate to within a factor of 2 on 30 per cent of occasions, to within a factor of 4 on about 60 per cent of occasions and to within a factor of 10 on nearly all occasions when compared to the known probabilities of events. Interestingly, in the SLIM assessment, the combined generic weightings and task weightings showed little or no relationship to actual data, whereas the ‘untreated’ task ratings accounted for about 50 per cent of the variance. Kirwan also found that group consensus values offered no advantage over individual estimates, no one ‘expert type’ performed significantly better than any other, and that calibration feedback to did not significantly improve APJ estimates.
If the analyst is looking for scrutability, THERP offers the most comprehensible form of modelling. But, as Williams (1985, p. 160) concluded, “if they are seeking methods which are scrutable, accurate and usable by non-specialists, the short answer is that there is no single method to which they can turn. The developers of human reliability assessment techniques have yet to demonstrate, in any comprehensive fashion, that their methods possess much conceptual, let alone, empirical validity.”
3.2. The human factors reliability benchmark exercise
Recently, the Joint Research Centre (Ispra) of the European Commission (Poucet, 1988) organised a systematic comparison of HRA modelling techniques in the nuclear power context. Fifteen teams from eleven countries applied selected HRA techniques to two case studies: (a) the analysis of routine testing and maintenance with special regard to test-induced failures, and (b) the analysis of human actions during an operational transient with regard to the accuracy of operator diagnosis and the effectiveness of corrective actions. The methods applied included THERP, SLIM, HCR and TESEO.
In both cases, there was considerable variability in the quantitative results. The main contribution to this lack of agreement was the problem of mapping the complex reality on to these relatively simple models. This was particularly evident in the wide variety of modelling assumptions and in the different levels of decomposition used in the analyses. There was also considerable variation in the ways in which recovery and dependency structures were included in the chosen model. Mistakes were found to be far harder to quantify than slips and lapses.
Within a particular team, the results obtained from THERP and SLIM agreed fairly well. However, this concordance was largely due to the use of THERP data in SLIM calibrations. As indicated earlier, the choice of anchoring points (the calibration e
rror probabilities) in the SLIM application greatly influenced its output.
The exercise also revealed some of the dangers in using a large and detailed error data base, such as that associated with the THERP technique. There was a marked tendency for analysts to model only those errors that appeared in the data base and to ignore others that qualitative analyses have shown to be important. Interestingly, the performance influencing factors given the highest importance by the experts in the SLIM technique were not considered in the THERP application because the THERP handbook had made no provision for them. THERP, however, was confirmed as being the reference method for human reliability assessment of procedural tasks. But this is hardly surprising since, for the present at least, the THERP handbook is the only readily available source of data on human error probabilities.
3.3. Qualitative criteria
A number of authors have sought to make explicit the general criteria that a good HRA technique should satisfy. Schurman and Banks (1984) provided a table for assessing the merits of the nine HRA models they reviewed. The criteria included cost, breadth of applicability, ease of use, face validity, precision, acceptance by professionals of the quality of the output, sensitivity to differences in the systems under evaluation, predictive validity, replicability, scrutability (this means no hidden assumptions, no special operations and full documentation) and how closely the output of the model approximates to a true ratio-scale measure. On these criteria, Schurman and Banks found no method scoring above 50 per cent of the possible total score when a quantitative rating scale was applied to each dimension.