Expert Political Judgment

Home > Other > Expert Political Judgment > Page 43
Expert Political Judgment Page 43

by Philip E. Tetlock


  BACKGROUND INFORMATION

  To assist recall, we provided a chronology of key events, which began on October 16, 8:45 A.M., when Bundy broke the news to JFK that Soviet surface-to-surface missiles were being deployed in Cuba and ended on October 29 when Stevenson and McCloy meet with Kuznetsov in New York to work out details of the settlement that Kennedy and Khrushchev had already reached.

  RETROSPECTIVE PERCEPTIONS OF INEVITABILITY AND IMPOSSIBILITY

  The order of administration of these questions was always counterbalanced. The instructions for the inevitability curve exercise were as follows: “Let’s define the crisis as having ended at the moment on October 29 when Kennedy communicated to the Soviet leadership his agreement with Khrushchev’s radio message of October 28. At that juncture, we could say that, barring unforeseen problems of implementing the agreement, some form of peaceful resolution was a certainty—a subjective probability of 1.0. Going backward in time, day by day, from October 29 to October 16, trace on the graph your perceptions of how the likelihood of a peaceful resolution rose or fell during the fourteen critical days of the crisis. If you think the U.S. and USSR never came close to a military clash between October 16 and 29, then express that view by assigning consistently high probabilities to a peaceful resolution across all dates (indeed, as high as certainty, 1.0, if you wish). If you think the superpowers were very close to a military conflict throughout the crisis, then express that view by assigning consistently low probabilities to a peaceful resolution across all dates. Finally, if you think the likelihood of a peaceful resolution waxed and waned from episode to episode within the crisis, then express that view by assigning probabilities that rise or fall in accord with your intuitions about how close the U.S. and USSR came to a military clash at various junctures in the crisis. To start, we have set the subjective probability of peace at 1.0 (certainty) for October 29, marking the end of the crisis.”

  The instructions for filling in impossibility curves were as follows: “Let’s think of the Cuban missile crisis from a different perspective. Rather than focusing on the outcome that did occur (some form of peaceful resolution of the dispute), let’s focus on the set of all possible more violent endings of the crisis. So, let’s define the crisis as having ended at the moment on October 29 when Kennedy communicated to the Soviet leadership his agreement with Khrushchev’s radio message of October 28 (offering to withdraw missiles in return for a public pledge not to invade Cuba and a private commitment to withdraw U.S. missiles from Turkey). At that juncture, we could say that, barring unforeseen problems with implementing the agreement, the likelihood of alternative more violent endings of the crisis had fallen to zero. The alternative outcomes had, in effect, become impossible. Now, going backward in time, day by day, from October 29 to October 16, trace on the graph your perceptions of how the likelihood of all alternative, more violent, endings of the crisis rose or fell during the fourteen critical days of the crisis. If you believe that the USA and USSR never came close to a military clash at any point between October 16 and 29, then feel free to express that view by assigning consistently low probabilities to a violent ending across all dates (indeed, as low as impossibility or zero, if you wish). If you think the superpowers were very close to a military clash throughout the crisis, then feel free to express that view by assigning consistently high probabilities to a violent ending across all dates. Finally, if you think the likelihood of a peaceful resolution waxed and waned from episode to episode within the crisis, then you should feel free to express that view by assigning probabilities that rise or fall in accord with your intuitions. To start, we have set the subjective probability of war at 0.0 (impossible) for October 29, marking the end of the crisis.”

  Research Procedures for the Rise of the West Experiment (Study 4)

  This study included two conditions: (a) a no-unpacking control condition (n = 27) in which experts generated inevitability curves for some form of Western geopolitical domination and impossibility curves for the set of all possible alternatives to Western geopolitical domination (order counterbalanced); (b) an intensive unpacking condition (n = 36) in which experts were first asked to unpack the set of all possible alternatives to Western geopolitical domination into progressively more detailed subsets, beginning with classes of possible worlds in which either no region of the world achieved global hegemony (either because of a weaker Europe or stiffer resistance from outside Europe, and moving on to classes of possible worlds in which a non-Western civilization itself achieved global hegemony—perhaps China, Islam, or the Mongols, or a less familiar alternative), then to rate the “imaginability” of each subset of scenarios, and then to complete the inevitability and impossibility curves that began at A.D. 1000 and moved by fifty-year increments up to A.D. 1850 (where the subjective probability of Western dominance was fixed at 1.0 and that of possible alternatives at 0.0). Order of inevitability and impossibility judgments was, again, always counterbalanced.

  1 For details on integrative complexity coding, see P. Suedfeld, P. E. Tetlock, and S. Streufert, “Conceptual/Integrative Complexity,” in Motivation and Personality: Handbook of Thematic Content Analysis, ed. C. P. Smith (Cambridge: Cambridge University Press, 1992).

  Technical Appendix

  Phillip Rescober and Philip E. Tetlock

  WE DIVIDE our analysis into two sections: one organized around correspondence indicators of good judgment that focus on the degree to which probability judgments mirror regularities in the external world and the other section organized around logical-coherence indicators that focus on the degree to which probability judgments obey the formal axioms of probability theory.

  PART A: CORRESPONDENCE INDICATORS OF GOOD JUDGMENT

  Probability Scoring

  Our primary correspondence indicator is the probability score (PS). We shall discover that it is useful (1) to decompose this measure of the goodness of fit of subjective probabilities to objective reality into a variety of indicators (variability, calibration, and discrimination); (2) to modify this measure to take into account a variety of objections (we shall examine five types of scoring adjustments designed to address five categories of objections).

  The simplest equation is

  (pi – xi)2

  where xi (a dummy variable) equals 1 if outcome i occurs or 0 otherwise, and pi is the forecaster’s prediction (or forecast) for a given outcome i.

  Forecasters receive the ideal score, zero, when they always assign predictions of 0 to outcomes that do not occur (in this case, (pi – xi)2 = (0 – 0)2 = 0) and always assign predictions of 1 to outcomes that do occur (in this case, (pi – xi)2 = (1 – 1)2 = 0).

  When the forecaster makes many (M) dichotomous predictions, the probability score is

  Example:

  We can readily adapt this procedure for forecasting problems with multiple outcomes. Assume a forecaster assigns probabilities of pA = 0.1, pB = 0.4, pC = 0.5 to three mutually exclusive and exhaustive possibilities: the future value of an outcome must be (a) “better than,” (b) “the same as,” or (c) “worse than” its current value. Suppose (c) occurs. The probability score is

  Probability Score Decomposition

  Probability scoring has a certain elegant simplicity. It does not, however, tell us all we need to know to answer key questions about judgmental performance. It is necessary to take additional steps: (a) to decompose the variance in probability scores to obtain more refined estimates of how good a job people are doing at assigning realistic probabilities to possible futures (measures of environmental variability and forecaster calibration and discrimination); (b) to adjust probability scores to address a host of potentially valid objections (introducing difficulty, value, controversy, fuzzy-set, and probability-weighting adjustments).

  A forecaster could get a high or low PS as a result of many distinguishable influences. Our analytical starting point is the Murphy decomposition, which breaks probability scores into three components: variability index (VI), calibration index (CI), and disc
rimination index (DI).1 The decomposition of the equation in the two-outcome case is

  where b is the base rate for a particular outcome (proportion of times an outcome occurs over all events)

  bt is the base rate for a particular prediction category. For example, if a forecaster predicted that ten events would occur with probability X, and the events occurred six of those ten times, then bx = 0.6.

  N is the total number of events

  nt is the number of predictions in the tth category

  T is the number of categories of predictions

  pt is the prediction of the tth category

  Components of Probability Scores: Variability, Calibration, and Discrimination

  Figure A.1 lays out the interpretive challenges raised by each of the three components.

  Variability is a measure of environmental (un)predictability. The range of values for the variability index is 0 ≤ VI ≤ 0.25.

  CASE 1 Easiest-to-predict environments: If the base rate is either 0 or 1, there is no variability (VI = 0) and a simple always-predict-the-base-rate strategy is perfectly accurate and receives a perfect probability score, zero.

  CASE 2 Increasingly-difficult-to-predict environments: As the base rate approaches 0.5, it becomes harder to predict which outcome will occur. Suppose the base rate is 0.8. This situation is easier to predict than 0.5 because, in the former case, one outcome occurs four times more often than the other (.8/.2). An always-predict-the-base-rate strategy yields a better expected probability score in the 0.8 environment (0.16) than in the 0.5 environment (0.25). More generally, the probability score will be inflated by increasingly large VI values as the base rate approaches 0.5 (and VI approaches its maximum value of .25).

  Calibration is the weighted average of the mean-square differences between the proportion of predictions correct in each probability category and the probability value of that category. The range of values for the calibration index is 0 ≤ CI ≤ 1, with zero corresponding to perfect calibration and representing the best possible score.

  Consider five ideal-type cases for the two-outcome model:

  CASE 1 A perfect calibration score of zero is achieved if no events assigned 0.0 occur, 10 percent of events assigned 0.1 occur, and so forth. One could achieve this score by possessing a keen awareness of the limits of one’s knowledge (from experience, one has learned that when one feels x degree of confidence, things happen x percent of the time). Or one could achieve this score by adopting a cautious fence-sitting approach to assigning probabilities that never strays from the base-rate probability of events.

  Figure A.1. Possible interpretations of different components of probability scores.

  Figure A.2. Four forms of imperfect calibration. Adapted from D. Koehler et al, The calibration of expert judgment. In T. Gilovich et al. (eds), Heuristics and biases. Cambridge, 2002.

  There are even more ways to be poorly calibrated. Cases 2 through 5 present four ways (each captured in figure A.2) of obtaining the same (rather poor) calibration score of 0.48.

  CASE 2 Overprediction is across the entire probability scale (subjective likelihood always exceeds objective likelihood).

  CASE 3 Underprediction is across the entire probability scale (estimated likelihood always less than objective likelihood).

  CASE 4 The overextremity pattern of underprediction for subjective probabilities below 0.5, and overprediction for subjective probabilities greater than 0.5.

  CASE 5 The underextremity pattern of overprediction for subjective probabilities below 0.5, and underprediction for subjective probabilities greater than 0.5.

  Discrimination indexes the judge’s ability to sort the predictions into probability categories (zero, 0.1, etc.) such that the proportions of correct answers across categories are maximally different from each other. Higher scores indicate greater ability to use the probability scale to distinguish occurrences from nonoccurrences than could have been achieved by always predicting the base rate of occurrence for the target event (b).

  The range of possible values for the discrimination index is 0 ≤ DI ≤ b(1 – b), with b(1 – b) corresponding to perfect discrimination and representing the best possible score. Note that the inequality statement can be rewritten as 0 ≤ DI ≤ VI, where VI denotes the variability index.

  Consider five ideal-type cases for the two-outcome model:

  CASE 1 A perfect discrimination score can be achieved by being phenomenally prescient and assigning zero probability to all events that do not happen and a probability of 1.0 to all events that do happen.

  CASE 2 As a squared indicator, the discrimination index is insensitive to the direction of differences between the frequency with which events in a probability category occur and the base-rate frequency of those events. Thus, it is possible to achieve a perfect discrimination score by being phenomenally inaccurate and assigning a probability of 1.0 to nonoccurrences and a probability of zero to occurrences.

  CASE 3 A perfect discrimination score can be achieved when the probability of the predicted outcome materializing randomly alternates between 0 and 1 as one moves across the subjective probability scale.

  The worst discrimination score, zero, can be achieved when the probability of the predicted outcome materializing within each subjective probability category always equals the overall base rate. This happens when either:

  CASE 4 The forecaster fails to make any discriminations and always predicts the same probability—whether it is the base rate, the chance prediction, or some other value—for all possible outcomes. (Thus, T = 1 and bt = b).

  CASE 5 The forecaster assigns different subjective probabilities, but these distinctions have no predictive value: the base rate within each probability category equals the overall base rate. (Thus, T > 1 and bt = b for all t = 1, 2, … T).

  Normalized Discrimination Index (NDI)

  Discrimination scores can fall between total ignorance (zero) and omniscience (equal to VI). The NDI tells us how well forecasters discriminate occurrences from nonoccurrences relative to the benchmark of omniscience:

  Calibration versus Discrimination: Distinguishing the Self-aware from Fence-sitters

  It is possible to have good calibration but poor discrimination scores if one is a fence-sitter who assigns only a narrow range of probability values (say, 0.4 to 0.6) and the target events occur between 0.4 and 0.6 of the time. And it is possible to have stellar discrimination and terrible calibration scores if one has a flair for making consistently and dramatically wrong predictions. Forecasters with good calibration and discrimination scores give us the best of both worlds: a realistic sense of what they can deliver and the ability to differentiate lower from higher probability events.

  Overconfidence versus Regression toward the Mean: Distinguishing Facts from Artifacts

  Demonstrating overconfidence requires more than demonstrating that, when experts assign 100 percent probability to x, x occurs less than 100 percent of the time, or that when experts assign zero probability to x, x occasionally occurs. We should expect such effects from regression toward the mean. For example, suppose that “overconfidence” is merely a by-product of measurement error in both subjective probabilities and objective frequencies. The best prediction of the objective frequency () for events at a given level of subjective probability (xi) would be

  where is the average subjective probability

  sx and sy represent standard deviations of subjective probabilities and objective frequencies

  rxy is the correlation of subjective probabilities and objective events

  = average objective frequency

  The substantive question is whether forecasters are so overconfident—across probability values—that it becomes implausible to dismiss the phenomenon as a “mere regression artifact.” We have a compelling argument against this claim if we can show that the average probability judgment is significantly different from the average objective frequency (or the base rate) for well-specified classes of events. A significant difference
is evidence of systematic forecasting error that is logically independent of regression.

  where is the average prediction b is the base rate

  N is the number of predictions made

  is the unbiased sample variance in the predictions made; it is calculated by

  is the variance in the base rate; it is calculated by

  Another approach for escaping the regression-toward-the-mean criticism is to demonstrate that the average forecasting error in one group differs from that of another group. To do this, one compares the difference of differences between subjective probability and objective frequency.

  where i is the average prediction for group i

  bi is the base rate for group i

  Ni is the number of predictions that group i had to make

  is the unbiased sample variance for group i defined by

  A substantive, as opposed to chance-driven, interpretation should gain in credibility to the degree one can show that, measurement error roughly constant, the probability-reality gap varies predictably as a function of independent variables hypothesized to inflate overconfidence (cognitive style, expertise, extremism, short-term versus long-term forecasts, etc.)—in effect, a construct-validational argument.

 

‹ Prev