Expert Political Judgment

Home > Other > Expert Political Judgment > Page 44
Expert Political Judgment Page 44

by Philip E. Tetlock


  EXTENSION TO MULTIPLE OUTCOME CASE

  In the three-outcome case, we can compute variability, calibration, and discrimination indices by averaging across all possible pairwise discrimination tasks: the ability to differentiate the status quo from change for either the better or the worse, the ability to differentiate change for the worse from either the status quo or change for the better, and the ability to differentiate change for the better from either the status quo or change for the worse.

  OPERATIONALIZING THE “MINDLESS” COMPETITION

  We could have operationalized the dart-throwing chimp by selecting probability predictions from random number tables. However, the long-term expected value of this strategy (assuming we counterbalance the order in which the three possible futures were assigned probability values) converges on .33. To compute the probability score, the formula was

  where ci is the long-term expected value of just guessing, which is 1/M

  xi takes on the values of 0 or 1 depending on whether the event in question occurred

  M is the number of outcomes for this particular event

  Note that this strategy will underperform the base-rate extrapolation algorithms to the degree outcomes were not equiprobable.

  We operationalized the base-rate prediction strategy by selecting probability predictions that corresponded to the observed frequency with which outcomes occurred for particular variables and populations of comparison cases. Table 6.1 (Chapter 6) presents the percentages used in the contemporaneous base-rate algorithm: the frequency with which the status quo, change for the better, and change for the worse occurred across all outcome variables in the forecasting periods in which we were assessing the accuracy of human forecasters. To compute the probability score for an event with M outcomes, the formula was

  where bi is the base-rate probability prediction and

  xi takes on the values of 0 or 1 depending on whether the event in question occurred

  M is the number of outcomes for this particular event

  Base rate estimates depend on how restrictively or expansively, historically or contemporaneously, we define comparison populations. But, as noted in chapter 6, the current results are robust across a wide range of plausible estimates used for computing difficulty-adjusted probabilities scores.

  We operationalized the case-specific and time series extrapolation algorithms by basing predictions on the trend lines for specific variables and countries. Cautious case-specific extrapolations assigned probabilities 50 percent greater than guessing (.33) to possible futures that simply extended recent trends. For trichotomous futures, the values would thus be 50 percent (trend continuation), 25 percent, and 25 percent. Aggressive case-specific extrapolations assigned probabilities twice the value of guessing (67 percent, 16.5 percent, 16.5 percent). Hyperaggressive extrapolations placed 100 percent confidence in the proposition that the recent past of country x would also be its near-term future.

  OPERATIONALIZING THE SOPHISTICATED COMPETITION

  If experts could not beat informal predict-the-past algorithms, one might wonder what motive, aside from Schadenfreude, would prompt us to bring on even more formidable competition from formal statistical algorithms.

  It turns out, however, that, although experts do indeed lose by greater margins to formal statistical models (generalized autoregressive distributed lag models), they do not lose by massively greater margins. This is so because the formal models, although they outperform the informal ones, do not do so by massive margins. This result suggests that the true stochastic process governing many of the variables being forecast (call them yt) is well approximated by autoregressive processes of order one. In this situation, forecasters will do well by adopting simple rules such as “always predict rho * yt–1 + (1 – rho) * m,” where rho is some constant less than or equal to 1 which indicates the variable’s “persistence” and m is the unconditional mean to which the variable reverts over time (e.g., when rho = 1, the variable follows a random walk).

  There are other good reasons for determining, at least roughly, the predictability of the outcome variables that confronted forecasters (“roughly” because there is no guarantee that any statistical model incorporates all useful information available ex ante). To obtain crude approximations of predictability, we relied on generalized autoregressive models that lagged each outcome variable by two time periods on itself (first- and second-order autocorrelations) as well as by one time period on the three most highly correlated variables in our dataset (variables that should not have predictive power in pure AR1 processes but do occasionally have predictive power here).

  The squared multiple correlations in these “optimal” forecasting equations ranged from .21 (long-range predictions of inflation) to .78 (short-term predictions of government spending priorities). These equations define plausible maximum performance ceilings for each outcome variable. Indeed, we can compare human forecasters directly to these equations if we treat the statistically predicted values of the outcome variables as if they were subjective probability forecasts. For instance, one translation rule is to stipulate that, whenever the statistically predicted value falls within one of the three-possible-future ranges of values, and the 95 percent confidence band around the predicted value does not cross the borders between possible futures, assign a value of 1.0—otherwise assign .75. When we implement rules of this sort, across outcome variables, we discover that the discrimination scores for the equations handily surpass those of all groupings of human forecasters (equation ranges between .05 and .10 versus human range between .01 and .04) and the calibration scores rival those of the best human forecasters (average CI for equations is .011). Such results demonstrate there was considerable predictability that experts, regardless of cognitive style, failed to capture (a point underscored by figures 2.5 and 3.2).

  These equations also allow us to gauge how the relative size of the group gaps in forecasting performance varied as a function of the relative predictability of outcome variables (we discover, for example, that the fox advantage over hedgehogs was relatively evenly spread out across outcome variables with small to large squared multiple correlations). These equations also remind us that, although each expert made many short-term and long-term predictions across three policy domains, each predictive failure or success should not be viewed as an independent statistical estimate of predictive ability—hence the importance of making conservative assumptions about degrees of freedom in significance testing (treating forecasters rather than forecasts as the principal observational unit).

  Adjustments of Probability Scores to Address Conceptual Objections

  One can raise at least five categories of objections to probability scoring as a valid measure of forecasting skill:

  1. Probability scores are sensitive to irrelevant factors, such as variation in the predictability of political environments (hence, the need for difficulty adjustments).

  2. Probability scores are insensitive to relevant factors, such as variation in the political values that forecasters try to achieve (hence the need for value adjustments that reflect the shifting weights experts place on avoiding errors of underprediction and overprediction).

  3. The reality checks used to index accuracy are flawed and require controversy adjustments that reflect disagreement over what actually happened.

  4. The reality checks used to index accuracy are flawed and require fuzzy-set adjustments that reflect disagreement over what almost happened or might yet happen.

  5. Probability scores are based on the assumption that subjective probabilities have simple linear properties, but there is considerable evidence that subjective probabilities have nonlinear properties and that people treat errors in the middle range of the scale as less consequential than errors at the extremes.

  Difficulty-adjusted Probability Scores

  Probability scores can be inflated by two logically separable sources of variation: the shortcomings of forecasters and the unpredictability of the world. Difficu
lty adjustments “control” for the second source of variation, thereby giving us a “cleaner” measure of forecasting skill. The rationale is simple. A weather forecaster in Arizona (where it rains 5 percent of the time) has an easier job than one in Oregon (where it rains 50 percent of the time). Winkler’s method of difficulty-adjusting probability scores takes into account such variation in predictability, and tells us whether performance differentials between groups persist after leveling the playing field.2 We slightly modify Winkler’s original difficulty-adjusted formula so that

  where S* (p) is the skill score that is applied to the expert’s forecast, p, S(p) is the expert’s probability score based on the quadratic scoring rule, and S(b) is the expert’s probability score if she always used the base rate for her forecasts. T(b) is the denominator, and its value depends on whether the forecast is higher or lower than the base rate (taking on the value of (1 – b)2 when a prediction is greater than or equal to the base rate, and b2 when the prediction is lower than the base rate). A score of zero implies that the forecaster tells us nothing about future states of the world that we could not glean from crude base-rate summaries of past states of the world. For example, a weather forecaster in Phoenix might have a fantastic probability score (close to zero, say .0025) but will get a skill score of zero if the base predictor (say, always assign a .95 to sunshine) has the same fantastic probability score: S* (p) = (.0025 – .0025)/TB = 0. (Note that now higher skill scores are better.) A weather forecaster in Portland, however, with the same probability score, .0025, will have an impressive skill score because the base-rate predictor (say, always assign a .5 to “sunshine”) has such an anemic probability score (.25) and TB must take on a relatively small value because the forecaster, in order to achieve his probability score in the high-variance Portland environment, had to make a lot of predictions that both strayed from the base rate and strayed in the correct direction.

  Difficulty adjustments take into account that when the base rates are extreme, anyone can look farsighted by predicting that rare outcomes will not occur and common ones will. To give experts the incentive to outperform predictively potent base rates, we need to reduce the penalties for going out on a limb and overpredicting low-base-rate events that fail to materialize, and to increase the penalties for mindlessly relying on the base rate and underpredicting low-base-rate events that do materialize. Difficulty-adjusted scoring does that. Figure A.3 shows that the difficulty-adjusted scoring curves for events and nonevents always intersect at the probability value on the x axis that corresponds to the base rate and at the score value on the y axis that corresponds to zero. In effect, the policy is “If you cannot tell us anything more than we would know from looking at the base rate, you deserve a score of zero.” The sharp inflection in the scoring curves at the intersection point in skewed base-rate environments serves two purposes: it punishes experts who miss uncommon outcomes because they always predicted the base rate (hence the sharp downward slope into negative territory for experts who assign probabilities close to zero for rare events that occur as well as for experts who assign probabilities close to 1.0 for common events that fail to occur), and it rewards experts for having the courage to assign greater-than-base-rate probabilities to uncommon events that occur and lower-than-base-rate probabilities to common events that fail to occur (hence these curves rise into positive territory much earlier than the curves of experts who endorsed the base-rate prediction and were correct).

  When skill scoring is applied to a dichotomous event, the computational formulas for take the following specific forms:

  Figure A.3. How the subjective probability forecasts translate into difficulty-adjusted forecasting scores when the target event either does or does not occur and the base rate for the target event is low (0.1), moderate (0.5), or high (0.9). Higher scores represent better performance, and scores fall off steeply when forecasters underpredict rare events that do occur (left panel) and overpredict common events that fail to occur.

  Figure A.3 also shows how subjective probability forecasts translate into difficulty-adjusted forecasting skill across the full range of possible values. The easiest environments are those with extreme base rates in the first and third panels. The specified outcome occurs either 10 percent or 90 percent of the time. The hardest environment is the middle panel. Here, the base rate offers no more guidance than one could glean from chance (in the two-possible-futures case, a base rate of 0.5).

  Difficulty-adjustment formulas level the playing field for comparisons of the forecasting performance of experts who work in easier-versus harder-to-predict environments. There are, however, good reasons for caution in interpreting such scores. First, the scoring procedure is open to a general challenge. By aggressively encouraging experts to search for regularities beyond the safe predict-the-base-rate strategy, the formula reduces the scoring penalties imposed on suboptimal prediction strategies (such as probability matching discussed in chapter 2) that look for patterns in randomness. Critics can argue, in effect, that difficulty adjustments encourage the sorts of specious reasoning that led the Yale undergraduates to do a worse job than Norwegian rats in guessing which side of the T-maze would contain the yummy pellets.

  Second, the difficulty-adjusted scoring is open to the objection that, in contrast to fields such as radiology, where there is consensus on the right reference populations for estimating the base rates of malignancy of tumors, there is often sharp disagreement among political observers over what, if any, statistical generalizations apply. This is so, in part, because of the heterogeneity of the polities, cultures, and economies of the entities being judged and, in part, because of the dearth of well-established laws for guiding judgment.

  We hedged our bets. We reported a range of difficulty adjustments that reflected experts’ judgments of which base rates were most relevant to particular nations and periods. One can raise the estimated base rate of an outcome (such as nuclear proliferation between 1988 and 1998) by limiting the reference population to only high-risk states such as Iran, North Korea, and Pakistan, in which case proliferation definitely occurred 33 percent of the time, or one could lower the base rate by expanding the population to encompass lower-risk states such as Libya, Egypt, Brazil, and so on, in which case proliferation occurred less than 10 percent of the time. Chapter 6 assessed the stability of difficulty-adjusted scores by varying values of the base-rate parameter.

  Value-adjusted Probability Scores

  Value adjustments give forecasters varying benefits of the doubt when they under- or overpredict status quo, change-for-better, and change-for-worse outcomes. Overprediction (or a “false alarm”) occurs whenever a forecaster gives a probability greater than zero to a nonoccurrence. Underprediction (a “miss”) occurs whenever a forecaster gives a probability less than 1.0 to an occurrence. Overprediction was defined mathematically as

  where pj is the forecast on the jth occasion of events that did not occur N0 is the number of events that did not occur

  Similarly, underprediction was defined as

  where pj is the forecast on the jth occasion of events that occurred

  N1 is the number of events that occurred

  We explored two methods of value-adjusting probability scores:

  1. The k method that searched for a single value k designed to minimize gaps between subjective probabilities and objective reality; and

  2. The “differential-weighting” method that explored the impact of a wide range of adjustments on errors of overprediction (a0) and underprediction (a1), within certain mathematical constraints on a0 and a1.

  The former method gives forecasters the most unconditional benefit of the doubt: whatever mistakes experts are most prone to make, it assumes that those mistakes are purposeful, and introduces a correction factor k that reduces the gaps between specific subjective probabilities and objective frequencies by the average magnitude of the gap. The latter method requires the researcher to specify the direction and degree of the bias to be corrected
(drawing, for instance, on experts’ expressed error-avoidance priorities), without looking at the mistakes experts actually made.

  The k Method

  This method of value adjusting takes the following form:

  where PSj is the probability score for the jth occasion when one of M outcomes must occur

  pi is the forecast (probability estimate) of the ith outcome

  ki is the value adjustment for the ith outcome

  xi is either 0 or 1 depending on whether the ith outcome occurred

  M is the number of possible outcomes for a given occasion

  The restriction is such that

  DERIVING ki

  Consider an occasion j with M possible outcomes. The unadjusted probability score is

  Summing up the PSj over N occasions, we find that

  where pij is the forecast (probability estimate) of the ith outcome for the jth occasion

  xij is the outcome of the ith outcome for the jth occasion

  The value-adjusted probability score would be

  Now we can find the value of ki that minimizes the probability score:

  Thus, the best value adjustment, kl, for the lth outcome is equal to the average forecast for the lth outcome minus the base rate for the lth outcome. The k parameter adjusts experts’ probability estimates by whatever amount they on average differed from the mean observed outcome. Note that applying k in this fashion requires pushing some individual forecasts, pi – ki, above 1.0 or below 0 and, for this reason, k-adjusted probability scores should be interpreted only at the aggregate level.

 

‹ Prev