Expert Political Judgment

Home > Other > Expert Political Judgment > Page 23
Expert Political Judgment Page 23

by Philip E. Tetlock


  CHAPTER 4

  Honoring Reputational Bets

  FOXES ARE BETTER BAYESIANS THAN HEDGEHOGS

  When the facts change, I change my mind. What do you do, sir?

  —JOHN MAYNARD KEYNES

  CHAPTERS 2 AND 3 measured expert performance against correspondence benchmarks of good judgment. The test was “getting it right”: affixing realistic probabilities to possible futures. The spotlight shifts in chapters 4 and 5 to coherence and process benchmarks of good judgment. The focus is on “thinking the right way”: judging judgment on the logical soundness of how we go about drawing inferences, updating beliefs, and evaluating evidence. These alternative conceptions of good judgment are more complementary than competitive. It would be odd if people who think the right way failed to get more things right in the long run. Indeed, if they did not, should we not—before questioning common sense—question our measures?

  Chapter 4 relies on logical-coherence and process tests of good judgment derived from Bayesian probability theory. The coherence tests are static. They require single carefully aimed snapshots to capture the extent to which belief systems hang together in logically consistent ways. The process tests—which play the more central role here—are dynamic. They require at least two snapshots of forecasters’ belief systems, one before and one after they learn what happened. Good judges should be good hypothesis testers: they should update their beliefs in response to new evidence and do so in proportion to the extremity of the odds they placed on possible outcomes before they learned which one occurred. And good judges should not be revisionist historians: they should remember what they once thought and resist the temptation of the hindsight or “I knew it all along” bias.

  We shall discover that (a) even accomplished professionals often fail these tests of good judgment; (b) the same people who fared poorly against the correspondence tests in chapter 3 fare poorly against the coherence and process tests in chapter 4; (c) similar psychological processes underlie performance deficits on both correspondence and coherence tests of good judgment.

  A LOGICAL-COHERENCE TEST

  If I labored under any illusion that people are natural Bayesians, that illusion was dispelled well before I could check whether people are good belief updaters. The Bayesian framework rests on logical identities, and we can tell whether those identities are satisfied from single snapshots of belief systems at isolated slices of time. Imagine an observer who distinguishes only two possible interpretations of the world: his own and a rival’s. We can deduce the likelihood that observer should attach to an outcome (X1) from knowledge of (a) how confident the observer is in his own versus his rival’s reading of reality; (b) how likely the observer believes X1 to be if his versus his rival’s reading of reality is correct:

  P(X1) = P(X1 | Observer’s hypothesis)P(Observer’s hypothesis) + P(X1 | Rival hypothesis)P(Rival hypothesis)

  When we asked experts to make predictions in eleven regional forecasting exercises by filling in values for each variable on the left and right sides of this equation (see Methodological Appendix), it rarely dawned on anyone to base their likelihood-of-x estimates on anything beyond their conditional likelihood estimates of x that were predicated on their own view of the world. Their answers to these two questions were almost interchangeable (r = .83). It was as though experts were 100 percent confident that they were right and everyone else wrong. Therefore, the probability of x given their construal of the forces at work must be the same thing as the probability of x.

  A charitable view chalks this “mistake” up to linguistic confusion. People understandably think that, when we ask them about the likelihood of an event, we want their point of view, not someone else’s. But the results hold up even when we press the issue and, just prior to asking about the likelihood of an outcome, we solicit separate judgments of the likelihood of the rival perspective being true and of the likelihood of the outcome if the rival perspective were true. In estimating the likelihood of x, experts do not compute weighted averages of the likelihood of x conditional on various interpretations of the world being correct, with the weights proportional to experts’ confidence in each interpretation. They consult a gut-level intuition anchored around one point of view, their own, which they treat as an existential certainty.1

  There would also be nothing logically wrong with considering only one’s own view of the world if one were totally confident one was right. The second half of the right-hand side of the identity equation would fall to zero. But most participants—including hedgehogs—were not that sure of themselves. When we asked forecasters about the likelihood that other points of view might be correct, they assigned values substantially greater than zero (average .27). There would also be nothing wrong with considering only one’s own point of view if one believed that other perspectives made precisely the same predictions. But when we asked forecasters about the likelihood of their “most likely futures” conditional on other views being correct, they assigned values substantially lower than those they assigned conditional on their own view being correct (average gap of .46). The net result—as shown in figure 9.6 in the Technical Appendix—was an “egocentricity gap”: the probability that experts assigned their most likely futures was consistently higher than the value they should have assigned those futures if they were good Bayesians who took other points of view into account.

  This slighting of alternative perspectives is no harmless technicality. If forecasters had been better Bayesians, their forecasts would have been better calibrated. They would have assigned more realistic probability estimates to their most likely futures, shrinking probability-reality gaps by up to 26 percent. Both foxes and hedgehogs would have benefited, with estimated reductions up to 18 percent for foxes and 32 percent for hedgehogs (for details, see Technical Appendix).

  The pattern of results is an early warning that experts are not natural Bayesians who routinely treat experience as an opportunity for adjusting the odds ratios of competing hypotheses. A more plausible model is that we are naturally egocentric. In sizing up situations, we have difficulty taking other points of view seriously. Few of us spontaneously factor other views into our assessments—even points of view that, on second thought, we acknowledge have a nonnegligible likelihood of being right.2

  A DYNAMIC-PROCESS TEST: BAYESIAN UPDATING

  Giving short shrift to other points of view proves a recurring theme when we turn to process tests that probe forecasters’ willingness to change their minds in response to new evidence. Bayes’s theorem again sets the gold standard. Once we learn what happened in a forecasting exercise, the theorem tells us how much confidence we should retain in the hypotheses that underlie accurate and inaccurate forecasts. That final confidence ratio (the posterior odds ratio) should be a function of the confidence we initially had in the clashing hypotheses about the drivers of events (the prior odds ratio) multiplied by the beliefs we once held about the likelihood of the observed outcome assuming the correctness of either our own or other points of view (the likelihood ratio):

  Applying this framework was straightforward in early laboratory studies of belief updating. Researchers would tell participants that there was an unknown proportion of red and blue poker chips in a bag (thus, prior odds were 50/50) and then randomly sample ten chips from the bag, x of which turn out to be red and y blue. Researchers would compare how much people would change their minds about the color ratio to how much Bayes’s theorem says they should have changed their minds (posterior odds).3

  We lose this precision when we gauge experts’ reactions to unfolding real-world events, such as market meltdowns and mass murder. The dividing line between the rational and irrational, between the defensible and indefensible, becomes blurrier to the degree there is room for our prior beliefs to bias our assessments of whether we were right or wrong: there is little such room for judging the color of poker chips but a lot when it comes to judging movement toward political or economic freedom. That said, though, studies of judgments of re
al-world events are far from irrelevant. Such studies still speak volumes on how difficult it is for experts to concede that they were wrong. We discover how much we can discover by examining the regional forecasting studies in which we reduced forecasters’ wiggle room by eliciting ex ante commitments on the probative value of possible outcomes. Forecasters’ answers to the following questions gave us the inputs we needed for computing—to a crude first order of approximation—whether they were good belief updaters:

  1. How likely do you estimate each possible future if your understanding of the underlying forces at work is correct? In belief-updating equations, we designated these variables as p (x1 | your hypothesis), p (x2| your hypothesis) … where x1, x2 … refer to sets of possible futures and your hypothesis refers to “your view of the underlying forces at work.”

  2. How much confidence do you have in your understanding of the underlying forces at work? We designated this variable p (your hypothesis).

  3. Think of the most influential alternative to your perspective on the underlying forces. How likely is that perspective to be correct? We designated this variable as p (rival hypothesis).

  4. How likely do you think each possible future if this alternative perspective is correct? We designated these variables p (x1 | rival hypothesis), p (x2 | rival hypothesis)…. We used this format in seven forecasting domains, including the Soviet Union (1988), South Africa (1988), the Persian Gulf War of 1991, Canada (1992), Kazakhstan (1992), the U.S. presidential election of 1992, and the European Monetary Union (1992), as well as a different format in four other domains, including the European Monetary Union (1998), China (1992), Japan (1992), and India (1992).4

  These exercises are “reputational bets”: they ask experts to specify, as exactly as would an odds-setting bookie, predictions predicated on competing views of reality. After we have learned what happened, we can compute the posterior odds: the “correct” Bayesian answer to the question of how much experts should have increased or decreased their confidence in their prior worldviews. We can also recontact experts and pop the big question: Given your earlier reputational bet and the course of subsequent events, how much do you wish to change your confidence in your understanding of the forces at work? We can then see how well experts as a whole, or subgroups such as hedgehogs and foxes, stack up against Bayesian benchmarks of good judgment.

  Reactions to Winning and Losing Reputational Bets

  There are good reasons for expecting smart people to be bad Bayesians. Decades of laboratory research on “cognitive conservatism” warn us that even highly educated people tend to be balky belief updaters who admit mistakes grudgingly and defend their prior positions tenaciously.5 And decades of research on cognitive styles warn us that this problem will be more pronounced among thinkers who fit the hedgehog rather than the fox profile.6 Chapter 3 showed that hedgehogs are attracted to grand schemes that promise to explain a lot but that do not translate into a lot of forecasting successes. Hedgehogs should thus be more disappointed by disconfirmation, and delighted by confirmation, than foxes. Assuming no more than a common desire to maintain a positive-mood equilibrium, it follows that hedgehogs should try harder to neutralize disconfirming data by subjecting it to withering scrutiny and to savor confirming data by taking it at face value.

  Figure 4.1 shows that hedgehogs made bolder reputational bets and put more professional esteem on the line by emphatically differentiating their conditional expectations from those of rival perspectives. But their timing was bad: they made more of these dramatic predictions when they were dramatically wrong (their most likely futures failed to materialize) than when they were dramatically right (their most likely futures materialized). Thus, when hedgehogs lost, the Bayesian-prescribed confidence “hit” they suffered was larger than that for foxes; but when they won, the prescribed boost they enjoyed was only roughly equal to that for foxes.7

  Figure 4.1. The relative willingness of hedgehogs, hybrids (hedge-foxes and fox-hogs), and foxes to change their minds in response to relatively expected or unexpected events, and the actual amounts of belief adjustment compared to the Bayesian-prescribed amounts of belief adjustment.

  But betting is one thing, paying up another. Focusing just on reactions to losing reputational bets, figure 4.1 shows that neither hedgehogs nor foxes changed their minds as much as Reverend Bayes says they should have. But foxes move more in the Bayesian direction than do hybrids and hedgehogs. And this greater movement is all the more impressive in light of the fact that the Bayesian updating formula demanded less movement from foxes than from other groups. Foxes move 59 percent of the prescribed amount, whereas hedgehogs move only 19 percent of the prescribed amount. Indeed, in two regional forecasting exercises, hedgehogs move their opinions in the opposite direction to that prescribed by Bayes’s theorem, and nudged up their confidence in their prior point of view after the unexpected happens. This latter pattern is not just contra-Bayesian; it is incompatible with all normative theories of belief adjustment.8

  Shifting to reactions to winning reputational bets, figure 4.1 shows that everyone—hedgehogs, foxes, and hybrids—seems eager to be good Bayesians when that role requires reaffirming the correctness of their prior point of view. Belief adjustments now hover in the vicinity of 60 percent of the prescribed amount for foxes and 80 percent of the prescribed amount for hedgehogs.

  Taken together, these results replicate and extend two classic psychological effects. One is cognitive conservatism: the reluctance of human beings to admit mistakes and update beliefs.9 The other is the “self-serving” attribution bias: the enthusiasm of human beings for attributing success to “internal” causes, such as the shrewdness of one’s opinions, and failure to external ones, such as task difficulty, unfair testing conditions, or bad luck.

  Psychologists find it reassuring when their laboratory effects hold up so well in the messy real world. These effects are not, however, reassuring to those who believe the world would be a better place if people adhered to Bayesian canons of rationality. From this latter standpoint, it is critical to understand how experts manage to retain so much confidence in their prior beliefs when they “get it wrong.” Which belief system defenses do they switch on to take the sting out of disconfirmation? Do hedgehogs rely more on these defenses than foxes? And, most intriguing, are the contours of a theory of good judgment emerging: a theory that posits a self-reinforcing virtuous circle in which self-critical thinkers are better at figuring out the contradictory dynamics of evolving situations, more circumspect about their forecasting prowess, more accurate in recalling mistakes, less prone to rationalize those mistakes, more likely to update their beliefs in a timely fashion, and—as a cumulative result of these advantages—better positioned to affix realistic probabilities in the next round of events?

  Belief System Defenses

  Forecasters embraced a variety of ingenious arguments for reneging on reputational bets, but we reduce all of them here to seven categories of “belief system defenses”: challenging whether the logical conditions for hypothesis testing were satisfied, invoking the exogenous-shock, close-call counterfactual, and off-on-timing arguments, declaring politics hopelessly indeterminate, defiantly insisting they made the right mistake (and would do it again), and making the metaphysical point that unlikely things sometimes happen.

  We used both qualitative and quantitative research methods to explore how forecasters who got it wrong managed to preserve so much confidence they were right. Qualitative methods—to which we turn first—shed light on how forecasters interpret outcomes and why they frequently feel justified in not changing their minds. Listening to forecasters’ arguments reminds us why we should not write off all resistance as ego-defensive whining. Determining whether people are good Bayesians in real-world settings proves more than a matter of arithmetic. “Belief system defenses” may often be defensible efforts to redefine the likelihood ratios that determine how much belief change is warranted (a point we revisit in chapter 6).

 
; Quantitative methods—to which we turn second—remind us that, however justified forecasters’ arguments may be, there is strong statistical evidence of a self-serving bias operating in the overall pattern of argumentation. Experts invoke arguments from the list of seven only when they make big mistakes and virtually never when they “get it right.” There is also a strong statistical connection between how often experts invoke arguments from the list and how far short they fall as Bayesians.

 

‹ Prev