Expert Political Judgment

Home > Other > Expert Political Judgment > Page 6
Expert Political Judgment Page 6

by Philip E. Tetlock


  26 N. Silver, “A User’s Guide To FiveThirtyEight’s 2016 General Election Forecast,” FiveThirtyEight (ESPN), published June 29, 2016, retrieved 15 November 2016, http://fivethirtyeight.com/features/a-users-guide-to-fivethirtyeights-2016-general-election-forecast; J. Katz, “Who Will Be President?” New York Times, published November 8, 2010, retrieved November 15, 2010, http://www.nytimes.com/interactive/2016/upshot/presidential-polls-forecast.html?_r=0.

  27 Readers might be curious about the pros and cons of prediction markets versus tournaments as sources of nonpartisan probability pipelines into policy debates. Both beat the status quo handily but I prefer tournaments for a host of practical reasons. See P. E. Tetlock, B. A. Mellers, and J. Peter Scoblic, “Bringing Probability Judgments into Policy Debates via Forecasting Tournaments,” Science, 355, 481–83; also Atanasov et al., “Distilling the Wisdom of Crowds.”

  28 M. Lewis, Moneyball: The Art of Winning an Unfair Game (New York: Norton, 2004); M. Lewis, The Undoing Project: A Friendship that Changed Our Minds (New York: Norton, 2017).

  29 Kahneman, Thinking, Fast and Slow.

  30 Nassim Taleb denounces both the silliness of putting probability on events like the Arab Spring uprisings of 2011 and the absurdity of the poker analogy. Poker is a well-defined game of chance with random draws from a well specified sampling universe whereas life is chaotic. We never know for sure whether we are living in what Taleb calls Mediocristan, a well-behaved place of Gaussian-distributed outcomes, or Extremistan, where radical surprises pop up far more often than “Gaussians” would predict. One difference between Taleb’s views and mine is that he treats Black Swan events as categorical whereas I see a continuum of Swans, of varying grayness. N. Taleb, Nicholas, and M. Blyth, “The Black Swan of Cairo: How Suppressing Volatility Makes the World Less Predictable and More Dangerous.” Foreign Affairs (2011): 33–39.

  31 R. Atkins, “Euro Project Has Mountains to Climb: Lessons That Can Be Learnt from Greece’s Flirtation with Euro Exit.” www.ft.com, 16, July 2015. https://www.ft.com/content/b28de122-2aeb-11e5-8613-e7aedbb7bdb7; R. Atkins,“Why Grexit Odds Are Probably 99 Percent Wrong: Range of Estimates Reveals Markets’ Inability to Price Political Risks.” www.ft.com 2, July 2015. https://www.ft.com/content/bb18e6e6-1ff8-11e5-aa5a-398b2169cf79

  32 E. Rosch, “Principles of Categorization” in The Motion Aftereffect: A Modern Perspective, ed., G. Mather, F. Verstraten and S. Anstis (Cambridge: MIT Press, 1998); E. Rosch, “Principles of Categorization” in Cognition and Categorization, ed. E. Rosch and B. Lloyd (Hillsdale: Lawrence Erlbaum, 1978); E. Rosch and C. Mervis, “Family Resemblances: Studies in the Internal Structure of Categories,” Cognitive Psychology 7, no. 4 (1975): 573–605; L. Zadeh, “The Concept of a Linguistic Variable and its Application to Approximate Reasoning-I,” Information Sciences 8, no. 3 (1975): 199–249; L. Zadeh, “Outline of a New Approach to the Analysis of Complex Systems and Decision Processes,” IEEE Transactions on Systems, Man, and Cybernetics SMC-3, no. 1 (1973): 28–44. L. Zadeh, “Fuzzy Sets,” Information and Control 8, no. 3 (1965): 338–53

  33 Chang et al., “Developing Expert Political Judgment.”

  34 J. Friedman, J. Baker, B. Mellers, P. Tetlock, R. Zeckhauser, “The Value of Precision in Probability Assessment: Evidence from Large-Scale Geopolitical Forecasting Tournament.” International Studies Quarterly, in press.

  35 “Weapons of mass destruction” was an umbrella term that included biological and chemical as well as nuclear weapons. Lowering the probability from 1.0 to .8 or even .6 might have moved Congress but probably not certain decision makers in the Executive branch. In a conversation in the White House Situation Room, Cheney commented on WMDs: “If there’s a one percent chance that Pakistani scientists are helping al Qaeda build or develop a nuclear weapon, we have to treat it as a certainty in terms of our response.” Cheney continued, “It’s not about our analysis, or finding a preponderance of evidence. It’s about our response.” See R. Suskind, One Percent Doctrine: Deep Inside America’s Pursuit of its Enemies Since 9/11 (New York: Simon and Schuster, 2006) pp. 61–62.

  36 Tom Nichols captures the backlash against predictive accuracy “Failed predictions do not mean very much in terms of judging expertise . . . predictive failure does not retroactively strip experts of their claim to know more than laypeople.” The Death of Expertise (New York: Oxford University Press, 2016), pp. 202–3. I doubt Nichols sees predictive accuracy as 100% irrelevant—and I don’t see good judgment as 100% reducible to accuracy. But we can’t take the next step and clarify the key phrase “very much” unless we adopt a much more granular language for expressing uncertainty.

  37 S. P. Huntington, “The clash of civilizations?” in Culture and Politics, 99–118. (New York: Palgrave Macmillan, 2000); D. P. Moynihan, Pandaemonium: Ethnicity in International Politics (New York: Oxford University Press, 1993).

  38 G. Allison. The Thucydides Trap: Are the U.S. and China headed toward war. September 24, 2015, The Atlantic.

  39 R. Jervis, System Effects: Complexity in Political and Social Life (Princeton, NJ: Princeton University Press, 1998).

  40 Nasty altercations between ships in the East China Sea are moderately common—and seem to occur quasi-randomly. Algorithms work best at this micro-level of analysis. Human judgment comes increasingly into play as we try to assess the risk that any one of these incidents will rise to serious diplomatic or military significance

  41 J. Snyder, “Richness, Rigor, and Relevance in the Study of Soviet Foreign Policy,” International Security 9, no. 3 (1984): 89–108.

  42 Tetlock and Gardner, Superforecasting, pp. 110–13, 114.

  43 B. Mellers, R. Hertwig, and D. Kahneman. “Do Frequency Representations Eliminate Conjunction Effects? An Exercise in Adversarial Collaboration.” Psychological Science 12, no. 4 (2001): 269–75.

  44 Tom Nichols quotes Dilbert cartoonist and Trump backer, Scott Adams, to illustrate how U.S. public opinion has crossed the line from justified skepticism of expertise into Know-Nothingism. Adams proposes that intelligent but ill-informed policy-makers can learn all they need from short expert briefings—so don’t fret over candidates’ grasp of details during debates. Experts find this stance exasperating. But how wrong is it? Daniel Kahneman was the source in EPJ of the conjecture that experts would be hard-pressed to out-predict “attentive readers of the New York Times.” Our world is so unpredictable that we reach the inflection point of diminishing predictive returns to knowledge disconcertingly fast. And EPJ did find that “sophisticated dilettantes” gave experts hard runs for their money. Insofar as policy advice can be reduced to conditional forecasts about policy options, Kahneman’s conjecture does indeed imply that good briefers should get most policymakers to the optimal forecasting frontier quite quickly. Of course, this does assume that the policymakers are all good listeners. And we all suspect there is much more to good judgment than probability estimates. But the “much-more” argument must be made much more rigorously. Assertions, even emphatic ones like “I know it when I see it,” do not count as evidence.

  45 N. Silver, “Dear Media, Stop Freaking Out About Donald Trump’s Polls,” November 23, 2015, http://fivethirtyeight.com/features/dear-media-stop-freaking-out-about-donald-trumps-polls/

  46 H. D. Lasswell, Psychopathology and Politics (Chicago: University of Chicago Press, 1986).

  47 Haidt, The Righteous Mind.

  48 R. B. Cialdini, K. D. Richardson, “Two Indirect Tactics of Image Management: Basking and Blasting,” Journal of Personality and Social Psychology, 39(3) (1981): 406–15.

  Expert Political Judgment

  CHAPTER 1

  Quantifying the Unquantifiable

  I do not pretend to start with precise questions. I do not think you can start with anything precise. You have to achieve such precision as you can, as you go along.

  —BERTRAND RUSSELL

  EVERY DAY, countless experts offer innumerable opinions in a dizzying array of forums. Cynics groan that expert communitie
s seem ready at hand for virtually any issue in the political spotlight—communities from which governments or their critics can mobilize platoons of pundits to make prepackaged cases on a moment’s notice.

  Although there is nothing odd about experts playing prominent roles in debates, it is odd to keep score, to track expert performance against explicit benchmarks of accuracy and rigor. And that is what I have struggled to do in twenty years of research of soliciting and scoring experts’ judgments on a wide range of issues. The key term is “struggled.” For, if it were easy to set standards for judging judgment that would be honored across the opinion spectrum and not glibly dismissed as another sneaky effort to seize the high ground for a favorite cause, someone would have patented the process long ago.

  The current squabble over “intelligence failures” preceding the American invasion of Iraq is the latest illustration of why some esteemed colleagues doubted the feasibility of this project all along and why I felt it essential to push forward anyway. As I write, supporters of the invasion are on the defensive: their boldest predictions of weapons of mass destruction and of minimal resistance have not been borne out.

  But are hawks under an obligation—the debating equivalent of Marquis of Queensbury rules—to concede they were wrong? The majority are defiant. Some say they will yet be proved right: weapons will be found—so, be patient—or that Baathists snuck the weapons into Syria—so, broaden the search. Others concede that yes, we overestimated Saddam’s arsenal, but we made the right mistake. Given what we knew back then—the fragmentary but ominous indicators of Saddam’s intentions—it was prudent to over- rather than underestimate him. Yet others argue that ends justify means: removing Saddam will yield enormous long-term benefits if we just stay the course. The know-it-all doves display a double failure of moral imagination. Looking back, they do not see how terribly things would have turned out in the counterfactual world in which Saddam remained ensconced in power (and France wielded de facto veto power over American security policy). Looking forward, they do not see how wonderfully things will turn out: freedom, peace, and prosperity flourishing in lieu of tyranny, war, and misery.1

  The belief system defenses deployed in the Iraq debate bear suspicious similarities to those deployed in other controversies sprinkled throughout this book. But documenting defenses, and the fierce conviction behind them, serves a deeper purpose. It highlights why, if we want to stop running into ideological impasses rooted in each side’s insistence on scoring its own performance, we need to start thinking more deeply about how we think. We need methods of calibrating expert performance that transcend partisan bickering and check our species’ deep-rooted penchant for self-justification.2

  The next two sections of this chapter wrestle with the complexities of the process of setting standards for judging judgment. The final section previews what we discover when we apply these standards to experts in the field, asking them to predict outcomes around the world and to comment on their own and rivals’ successes and failures. These regional forecasting exercises generate winners and losers, but they are not clustered along the lines that partisans of the left or right, or of fashionable academic schools of thought, expected. What experts think matters far less than how they think. If we want realistic odds on what will happen next, coupled to a willingness to admit mistakes, we are better off turning to experts who embody the intellectual traits of Isaiah Berlin’s prototypical fox—those who “know many little things,” draw from an eclectic array of traditions, and accept ambiguity and contradiction as inevitable features of life—than we are turning to Berlin’s hedgehogs—those who “know one big thing,” toil devotedly within one tradition, and reach for formulaic solutions to ill-defined problems.3 The net result is a double irony: a perversely inverse relationship between my prime exhibit indicators of good judgment and the qualities the media prizes in pundits—the tenacity required to prevail in ideological combat—and the qualities science prizes in scientists—the tenacity required to reduce superficial complexity to underlying simplicity.

  HERE LURK (THE SOCIAL SCIENCE EQUIVALENT OF) DRAGONS

  It is a curious thing. Almost all of us think we possess it in healthy measure. Many of us think we are so blessed that we have an obligation to share it. But even the savvy professionals recruited from academia, government, and think tanks to participate in the studies collected here have a struggle defining it. When pressed for a precise answer, a disconcerting number fell back on Potter Stewart’s famous definition of pornography: “I know it when I see it.” And, of those participants who ventured beyond the transparently tautological, a goodly number offered definitions that were in deep, even irreconcilable, conflict. However we set up the spectrum of opinion—liberals versus conservatives, realists versus idealists, doomsters versus boomsters—we found little agreement on either who had it or what it was.

  The elusive it is good political judgment. And some reviewers warned that, of all the domains I could have chosen—many, like medicine or finance, endowed with incontrovertible criteria for assessing accuracy—I showed suspect scientific judgment in choosing good political judgment. In their view, I could scarcely have chosen a topic more hopelessly subjective and less suitable for scientific analysis. Future professional gatekeepers should do a better job stopping scientific interlopers, such as the author, from wasting everyone’s time—perhaps by posting the admonitory sign that medieval mapmakers used to stop explorers from sailing off the earth: hic sunt dragones.

  This “relativist” challenge strikes at the conceptual heart of this project. For, if the challenge in its strongest form is right, all that follows is for naught. Strong relativism stipulates an obligation to judge each worldview within the framework of its own assumptions about the world—an obligation that theorists ground in arguments that stress the inappropriateness of imposing one group’s standards of rationality on other groups.4 Regardless of precise rationale, this doctrine imposes a blanket ban on all efforts to hold advocates of different worldviews accountable to common norms for judging judgment. We are barred from even the most obvious observations: from pointing out that forecasters are better advised to use econometric models than astrological charts or from noting the paucity of evidence for Herr Hitler’s “theory” of Aryan supremacy or Comrade Kim Il Sung’s juche “theory” of economic development.

  Exasperation is an understandable response to extreme relativism. Indeed, it was exasperation that, two and a half centuries ago, drove Samuel Johnson to dismiss the metaphysical doctrines of Bishop Berkeley by kicking a stone and declaring, “I refute him thus.” In this spirit, we might crankily ask what makes political judgment so special. Why should political observers be insulated from the standards of accuracy and rigor that we demand of professionals in other lines of work?

  But we err if we shut out more nuanced forms of relativism. For, in key respects, political judgment is especially problematic. The root of the problem is not just the variety of viewpoints. It is the difficulty that advocates have pinning each other down in debate. When partisans disagree over free trade or arms control or foreign aid, the disagreements hinge on more than easily ascertained claims about trade deficits or missile counts or leaky transfer buckets. The disputes also hinge on hard-to-refute counterfactual claims about what would have happened if we had taken different policy paths and on impossible-to-refute moral claims about the types of people we should aspire to be—all claims that partisans can use to fortify their positions against falsification. Without retreating into full-blown relativism, we need to recognize that political belief systems are at continual risk of evolving into self-perpetuating worldviews, with their own self-serving criteria for judging judgment and keeping score, their own stocks of favorite historical analogies, and their own pantheons of heroes and villains.

  We get a clear picture of how murky things can get when we explore the difficulties that even thoughtful observers run into when they try (as they have since Thucydides) to appraise the quality of ju
dgment displayed by leaders at critical junctures in history. This vast case study literature underscores—in scores of ways—how wrong Johnsonian stone-kickers are if they insist that demonstrating defective judgment is a straightforward “I refute him thus” exercise.5 To make compelling indictments of political judgment—ones that will move more than one’s ideological soul mates—case study investigators must show not only that decision makers sized up the situation incorrectly but also that, as a result, they put us on a manifestly suboptimal path relative to what was once possible, and they could have avoided these mistakes if they had performed due diligence in analyzing the available information.

  These value-laden “counterfactual” and “decision-process” judgment calls create opportunities for subjectivity to seep into historical assessments of even exhaustively scrutinized cases. Consider four examples of the potential for partisan mischief:

  a. How confident can we now be—sixty years later and after all records have been declassified—that Harry Truman was right to drop atomic bombs on Japan in August 1945? This question still polarizes observers, in part, because their answers hinge on guesses about how quickly Japan would have surrendered if its officials had been invited to witness a demonstration blast; in part, because their answers hinge on values—the moral weight we place on American versus Japanese lives and on whether we deem death by nuclear incineration or radiation to be worse than death by other means; and, in part, because their answers hinge on murky “process” judgments—whether Truman shrewdly surmised that he had passed the point of diminishing returns for further deliberation or whether he acted impulsively and should have heard out more points of view.6

 

‹ Prev