Expert Political Judgment

Page 45

by Philip E. Tetlock

Example:

Imagine three possible outcomes: status quo (SQ), change for the better (UP), and change for the worse (DOWN). Suppose a forecaster always predicts 0.8 for SQ, 0.1 for UP, and 0.1 for DOWN. Assume that the base rate for SQ, UP, and DOWN are 0.5, 0.25, and 0.25 respectively.

Note that the value adjustments are

• kSQ = SQ – bSQ = 0.8 – 0.5 = 0.3

• kUP = UP – bUP = 0.1 – 0.25 = –0.15

• KDOWN = DOWN – BDOWN = 0.1 – 0.25 = –0.15

The expected unadjusted probability score is given by

The expected value-adjusted probability score is given by

This method of value-adjusting probability scores says to forecasters: “It doesn’t matter whether you over- or underpredict. We will take out the difference between your average forecast and the base rate for the outcome.” This value adjustment will thus have larger effects to the degree that forecasters repeatedly make errors in the same direction and to the same degree.

There is understandable controversy over how far to take value adjustments. For instance, one possible concession to the “I made the right mistake defense” would be to apply separate adjustments to underprediction and overprediction, thus correcting for the average error in each direction. This concession is, however, too generous. For one thing, it makes broken-clock predictors (that always make the same mistakes) look perfectly calibrated. For another, separate adjustments “assume away the forecasting problem.” They rest on the far-fetched assumption that forecasters always knew in advance which error they were going to make. Accordingly, we opt here for single, one-size-fits-all-forecasts adjustments, but we recognize that others may legitimately opt to go further by configuring value adjustments to specific country-variable combinations.

Just as we recognized earlier that probability scores can be decomposed into VI, CI, and DI, the same can be done for the k-value adjusted probability score for N two-outcome events:

which correspond to VI, CI, and DI respectively. Notice that the only difference between the original and the adjusted decomposition is that k appears in the CI term. This value adjustment improves probability scores entirely by improving calibration scores.

The Differential-weighting Method

Another method of value-adjusting probability scores avoids the logical paradoxes of the k method. This alternate method applies differential weights to errors of overprediction (a0) and underprediction (a1):

where N0 is the number of events that did not occur

N1 is the number of events that occurred

N = N0 + N1

When a0 = a1 = 1, the value-adjusted PS is the unadjusted PS. When a0 and a1 take on different values, the value-adjusted PS has the potential to stray far from the unadjusted PS (depending, of course, on the gap between a0 and a1, and the size of the gap between errors of under- and overprediction).

We tested many combinations of a1 and a0, subject to a “constraint function”, h(a1, a0). The special case of an unadjusted probability score occurs when a1 = a0 = 1. Thus, the point (1, 1) must fall in the domain of this function. All other (a1, a0) must satisfy the equality h(a1, a0) = h(1, 1).

For example, if we set h(a1, a0) = a1 + a0, then h(1, 1) = 2. Thus we should test the points (a1, a0) such that a0 + a1 = 2. As Figure A.4 illustrates, other constraint functions we explored include h(a1, a0) = a1 a0 and h(a1, a0) = exp(a0) + exp(a1).

Recall that overprediction was defined as

and underprediction was defined as

Figure A.4. applies three value-adjustment functions to the data: linear, multiplicative, and natural logarithmic, each subject to constraints on the weights listed in the box. As the functions descend, they place progressively less value on avoiding false alarms (lower values of a0) and progressively greater importance on avoiding misses (higher values of a1). The functions intersect at value neutrality.

Figure A.4 shows the results of multiplying the original under- and over-prediction by the new a1 and a0. The circled area represents no value adjustment (a1 = a0 = 1).

The constraints placed on a0 and a1 are obviously somewhat arbitrary, but if we arrive at the same conclusion about group differences in forecasting skill after applying a wide range of plausible (a0, a1) adjustments, we can be reasonably confident that the observed group differences reflect differences in forecasting skill, not policy priorities.

Probability-weighting Adjustments to Probability Scores

In expected-utility theory, probability estimates enter the utility calculus in a straightforward, linear fashion. In choice theories that specify belief-weighting functions, probability estimates undergo complex transformations into decision weights.3

Drawing on this latter tradition, we developed a method of adjusting probability scores that takes into account the nonlinear nature of subjective probabilities. For instance, prospect theory posits that the shape of the probability-weighting function is determined by the psychophysics of diminishing sensitivity: marginal impact diminishes with distance from reference points. For monetary outcomes, the status quo serves as the sole reference point that distinguishes losses from gains. The resulting value function is concave for gains and convex for losses. For probability assessments, there are two reference points: impossibility (zero) and certainty (1.0). Diminishing sensitivity here implies an S-shaped weighting function that is concave near zero and convex near 1.0. This means that the weight of a probability estimate decreases with its distance from the natural boundaries of zero (impossibility) and 1.0 (certainty). Among other things, the probability-weighting function helps to explain the Allais paradox in which increasing the probability of winning a prize from .99 to 1.0 has more impact on choice than increasing the probability of winning from .10 to .11.

We apply the formula directly to the forecast p as follows:

where 0 < γ ≤ 1. We then enter this adjusted subjective probability forecast into the probability-scoring function which is now defined as

When γ = 1, the adjusted prediction equals the original prediction (w(p, γ) = p) and so the adjusted probability score equals the original probability score. As γ approaches zero, we treat all forecasts made with some doubt (subjective probabilities in the broadly defined “maybe zone” from .1 to .9) as increasingly equivalent. In other words, the differences among weighted subjective probabilities in the .1 to .9 range shrinks when the gamma weights approach zero. The question, of course, is how extreme should gamma be?

The most extreme psychological argument asserts: (a) people can reliably discriminate only three levels of subjective probability—impossibility (0), certainty (1), and a clump of intermediate “maybe” values (in which any expression of uncertainty is equivalent to any other); (b) the natural default assignment for the weighted probability is the state of maximum uncertainty in a trichotomous prediction task (.33), hence the rationale for the exponent LN3/LN2. More moderate psychological arguments would allow for the possibility that people can reliably discriminate several levels of subjective probability along the belief continuum from zero to 1.0. From this standpoint, the right value of gamma might well be one that yields a functional form similar to that posited in cumulative prospect theory (say γ between .5 and .7).

Figure A.5. shows the impact of gamma adjustments of varying extremity on probability scores. Extreme gamma adjustments flatten the midsection of the S-shaped curve by treating all probability judgments in the broadly defined “maybe zone” (from .1 to .9) as increasingly equivalent to each other. The more extreme the gamma adjustment, the less sensitive probability scores become to accuracy in the middle range of the probability scale and the more sensitive they become to accuracy at the end points where outcomes are declared either impossible or inevitable.

The net effect of these probability-weighting adjustments is threefold: (a) to reward experts who get it “really right” and assign undiluted-by-probabilistic-caveats predictions of zero to things that do not happen and predictions of 1.0 to things th
at do happen; (b) to punish experts who get it really wrong and assign predictions of zero to things that happen and predictions of 1.0 to things that do not happen; (c) to count movement in the wrong direction at the extremes as a particularly serious mistake (moving from 1.0 to .8 when the outcome happens makes one look more inaccurate than moving from .8 to .6).

Controversy-adjusted Probability Scores

Regardless of whether we rely on traditional scores or on adjusted or weighted scoring systems, the computations rest on assumptions about states of the world: either x happened (reality takes on the value of 1.0) or x did not happen (reality takes on the value of 0.0). Notwithstanding our efforts to elicit forecasts that passed the clairvoyance test, controversies sometimes arose as to what happened. Uncertainty shrouded:

a. Casualty estimates in poorly reported conflicts. How many deaths were traceable to sectarian violence within Nigeria or Liberia or the Congo between 1992 and 1997?

b. Who was in charge during power transitions or struggles? Was Deng Xiaoping—even in retirement in the early 1990s—still the de facto head of state in China? Was Khatami or Khamenei the real head of state in Iran in the late 1990s?

c. The classification of powers as nuclear or not (e.g., North Korea in 1998?)

d. The truthfulness of official government statistics on spending, debt, and macroeconomic performance. Did the Italian government “cook the books” on government finances to squeeze past the Maastricht criteria for currency convergence? How much faith can we put in the official economic growth or government debt or defense spending figures issued by the People’s Republic of China?

e. The treatment of cross-border conflicts (should we count Russian support for separatists in the Abkhazian region of Georgia and Russian military pursuit of Chechen guerrillas into Georgia?).

When plausible challenges surfaced (roughly 15 percent of the time), we recomputed probability scores to ensure that the conclusions drawn about expert performance do not hinge on arbitrary judgment calls. These controversy adjustments provided lower- and upper-bound estimates of accuracy, with lower-bound estimates indicating how poorly a group would perform if we worked with classifications of reality that worked to the maximum disadvantage of the group’s forecasts and upper bounds indicating how well the group would perform if the classifications consistently worked to the group’s advantage. It is also worth noting that we explored an alternative method of controversy adjustment which operates in the same fashion as fuzzy-set adjustments in the next section and which requires modifying our coding of nonoccurrences as zero and occurrences as 1.0 so that (a) nonoccurrences shift up in value and occurrences shift down; (b) the size of the shift is in proportion to the frequency and credibility of the controversy challenges.

Fuzzy-set Adjustments

The final modification of probability scores has a more radical, even postmodern, character: “fuzzy-set” adjustments that challenge the “binary objectivism” of probability scoring. Advocates of fuzzy-set measurement models argue that it is profoundly misleading to squeeze ambiguous and continuous outcomes into sharply defined, dichotomous categories. They urge us to treat such outcomes as they are: both “perspectival” (formally acknowledge ambiguity by allowing the coding of reality to vary with the vantage point of the observer) and a matter of degree (formally acknowledge the continuous character of such outcomes by allowing the coding of reality to take on values between zero and 1.0 and scoring predictions as true or false to varying degrees).4

Rising to this measurement challenge is, however, daunting: we need to figure out defensible ways of quantifying the degree to which things almost happened (or, in the case of controversy adjustments, uncertainty over what actually happened). We do not claim to have final answers. But we do offer our provisional solution, which was to take seriously what forecasters told us after the fact about how far off they felt their own predictions were. As noted in chapter 4, forecasters often invoked a wide range of arguments that asserted that, although the unexpected happened, they were not as wrong as they might appear because (a) their most likely outcome almost happened (close-call defense); (b) that outcome might yet happen (off-on-timing defense); (c) that outcome would have happened but for external shocks that no one could reasonably be expected to have foreseen (exogenous-shock defense).

We initially applied fuzzy-set adjustments simply in proportion to the degree to which forecasters relied on each of these belief system defenses. Let us begin by considering a forecaster who assigns a probability of .9 to a nonoccurrence. The unadjusted probability score is

The forecaster might also advance one or more of the belief system defenses. Fuzzy sets proceed to give some benefit of the doubt to the forecaster. Rather than assigning x = 0, we might estimate the proportion of times groups of forecasters (say, hedgehogs versus foxes) offer belief system defenses and then adjust the classification of reality accordingly (say, if forecasters offer defenses 30 percent of the time when they assigned 0.9 to things that did not happen, the adjustment might shift the reality classification from zero to .30). The probability score would then be given by

Conversely, consider a forecaster who assigns a probability of zero to an event that does occur. The probability score is

If forecasters argue 30 percent of the time that x “nearly did not occur or will soon disappear,” we let x = .7 and the probability score becomes

Thus far, we have implicitly agreed with forecasters whenever they offer a defense, but we can also attach credibility weights to defenses that range from zero (completely reject the defense) to 1.0 (completely accept the defense).

In adjusting the probability score, we can now reduce the gaps between subjective probabilities and objective reality in proportion to both the credibility weighting attached to defenses and the frequency with which defenses were offered. The general formula is

where “adj.” is short for “adjustment” and E short for Event.

Consider another stylized example. Suppose that a forecaster only assigns the probability value of 0.3 to each of 100 predictions; suppose further that the event occurs on 40 occasions. Of the 60 remaining nonoccurrences, the forecaster offers a belief system defense 20 times (one-third of the time). If we gave zero credibility to the belief system defenses, we would compute the first part of the numerator in the traditional manner:

If we gave 75 percent credibility weight to the defenses (accept 75 percent of what experts say as true), these 60 scores would be calculated as

Conversely, notice that the event occurred 40 times, meaning that the forecaster incorrectly assigned a low probability (p < 1) to 40 events that occurred. Of these 40 nonoccurrences the forecaster offers a defense 10 times. Once again, in adjusting the probability score with the fuzzy-set concepts, we may assign credibility weights ranging from zero to 1.0. If we gave zero credibility weight to these belief system defenses, we would compute the second part of the numerator as

If we gave 50 percent credibility weight to the defenses (accept 50 percent of what experts say as true), these 40 scores would be calculated as

The previous example illustrates the procedure for calculating the fuzzy-set-adjusted score when pt = .3. In general, for each of the pt = {0, .1, .2, …, 1} subjective probability categories, we can separate the predictions by those in which the event either occurred or did not occur. Within these subcategories, experts may or may not offer a defense. We then have the option of how large a credibility weight to assign these defenses.

For a more general statement of the fuzzy-set adjustment procedure, let be the percentage of defenses considered in the kth probability category when the event occurs, and let be the percentage of defenses considered in the kth probability category when the event does not occur. Let be the number of predictions in the kth probability category when the event occurred, and let be the number of predictions in the kth probability category when the event did not occur.

The probability score would then be calculated by

&n
bsp; Thus far, fuzzy-set adjustments have been applied in direct proportion to how often they were offered. It is noteworthy, however, that forecasters rarely used close-call arguments to imply that they themselves might have been lucky when they got it right (thus implying they were almost wrong). To address this objection, we apply a self-serving correction factor that reduces credibility weights as a function of the following fraction: “percentage of occasions when experts say something else almost happened when they got it right (assign .8, .9, or 1.0 to events that occur and 0.0, .1, or .2 to events that do not occur)” divided by “percentage of occasions when experts say something else almost happened when they got it wrong (assign 0.0, .1, or .2 to events that occur and .8, .9, or 1.0 to events that do not occur).” The smaller the fraction, the lower the credibility weights we give belief system defenses.

Summing Up Adjustments of Probability Scores

Estimates of forecasting skill rest on assumptions about both the external world (base rates of events and classifications of reality) and the forecasters themselves (their values and how they use probability estimates). That is why we strove to test the robustness of conclusions about the superior forecasting skill of foxes across a wide range of probability scoring adjustments. Recurring questions have been: When do hedgehogs “catch up?” And why?

It was not easy to produce catch-up effects in the current dataset. It is possible, however, to use Monte Carlo simulation methods that treat the current dataset as a special case of the vast range of variation in underlying forecasting-environment and forecaster-response-style parameters, including the base-rate distributions of events (e.g., status quo vs. change for either the better or the worse), the response distributions of forecasters’ judgments across probability scales, and the value priorities that forecasters place on avoiding false alarms or misses.

‹ Prev Next ›