Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 55
Statistical Inference as Severe Testing Page 55

by Deborah G Mayo


  Navigating the reforms requires a roadmap. In Tour I we’ ll visit the gallimaufry of very different notions of probability in current Bayesian discussions. Concerned that today’ s practice isn’ t captured by either traditional Bayesian or frequentist philosophies, new foundations are being sought – that’ s where we’ ll travel in Tour II.

  Strange bedfellows: the classical subjective Bayesian and the classical frequentist tribes are at one in challenging non-subjective, default Bayesians. The small, but strong tribes of subjective Bayesians, we may imagine, ask them:

  How can you declare scientists want highly probable hypotheses (or comparatively highly probable ones) if your probabilities aren’ t measuring reasonable beliefs or plausibility (or the like)?

  Frequentist error statisticians concur, but also, we may imagine, inquire:

  What’ s so good about high posterior probabilities if a method frequently assigns them to poorly tested claims?

  Let’ s look back at Souvenir D, where Reid and Cox (2015 , p. 295) press the weak repeated sampling requirement on non-frequentist assessments of uncertainty.

  The role of calibration seems essential: even if an empirical frequency-based view of probability is not used directly as a basis for inference; it is unacceptable if a procedure yielding regions of high probability in the sense of representing uncertain knowledge would, if used repeatedly, give systematically misleading conclusions.

  Frequentist performance is a necessary, though not a sufficient, condition for severe testing. Even those who deny an interest in performance might not want to run afoul of the minimal requirement for severity. The onus on those who declare what we really want in statistical inference are probabilities on hypotheses is to show, for existing ways of obtaining them, why ?

  Notice that the largest statistical territory is inhabited by practitioners who identify as eclecticists, using a toolbox of various and sundry methods. Some of the fastest growing counties of machine learners and data scientists point to spell checkers and self-driving cars that learn by conjecture and refutation algorithms, at times sidestepping probability models altogether. The year 2013 was dubbed The International Year of Statistics partly to underscore the importance of statistics to the Big Data revolution. The best AI algorithms appear to lack a human’ s grasp of deception based on common sense. That little skirmish is ongoing. Eclecticism gives all the more reason to clearly distinguish the meanings of numbers that stem from methods evaluating different things. This is especially so when it comes to promoting scientific integrity, reproducibility, and in the waves of methodological reforms from journals and reports. Efron has it right: “ Unlike most philosophical arguments, this one has important practical consequences” (2013 , p. 130). Let’ s land this balloon, we’ re heading back to the Museum of Statistics. If you’ ve saved your stub from Excursion 1, it’ s free.

  6.1 Bayesian Ways: From Classical to Default

  Let’ s begin Excursion 6 on the museum floor devoted to classical, philosophical, subjective Bayesianism (which I’ m not distinguishing from personalism). This will give us a thumbnail of the position that contemporary non-subjective Bayesians generally reject as a description of what they do. An excellent starting point that is not ancient history, and also has the advantage of contemporary responses, is Dennis Lindley’ s (2000 ) “ Philosophy of Statistics.” We merely click on the names, and authentic-looking figures light up and speak. Here’ s Lindley:

  The suggestion here is that statistics is the study of uncertainty (Savage 1977): that statisticians are experts in handling uncertainty …

  (p. 294)

  [Consider] any event, or proposition, which can either happen or not, be true or false. It is proposed to measure your uncertainty associated with the event … If you think that the event is just as uncertain as the random drawing of a red ball from an urn containing N balls, of which R are red, then the event has uncertainty R /N for you.

  (p. 295)

  Historically, uncertainty has been associated with games of chance and gambling. Hence one way of measuring uncertainty is through the gambles that depend on it.

  (p. 297)

  Consider before you an urn containing a known number N of balls that are as nearly identical as modern engineering can make them. Suppose that one ball is drawn at random from the urn … it is needful to define randomness. Imagine that the balls are numbered consecutively from 1 to N and suppose that, at no cost to you, you were offered a prize if ball 57 were drawn … [and] the same prize if ball 12 were drawn. If you are indifferent between the two propositions and, in extension, between any two numbers between 1 and N , then, for you, the ball is drawn at random. Notice that the definition of randomness is subjective; it depends on you.

  (p. 295)

  It is immediate from [Bayes’ Theorem] that the only contribution that the data make to inference is through the likelihood function for the observed x . This is the likelihood principle that values of x , other than that observed, play no role in inference.

  (pp. 309– 10)

  [U]nlike the frequency paradigm with its extensive collection of specialized methods, the coherent view provides a constructive method of formulating and solving any and every uncertainty problem of yours.

  (p. 333) 1

  This is so clear, clean, and neat. The severe tester, by contrast, doesn’ t object that “ specialized” methods are required to apply formal statistics. Satisfying the requirements of severe testing demands it, and that’ s unity enough. But let’ s see what some of Lindley’ s critical responders said in 2000. Press the buttons under their names. I’ ll group by topic:

  1. Subjectivity.

  Peter Armitage : “ The great merit of the Fisherian revolution, apart from the sheer richness of the applicable methods, was the ability to summarize, and to draw conclusions from, experimental and observational data without reference to prior beliefs. An experimental scientist needs to report his or her findings, and to state a range of possible hypotheses with which these findings are consistent. The scientist will undoubtedly have prejudices and hunches, but the reporting of these should not be a primary aim of the investigation.… There were indeed important uncertainties, about possible biases … [and the] existence of confounding factors. But the way to deal with them was … by scrupulous argument rather than by assigning probabilities …” (ibid., pp. 319– 20)

  David Cox “ It seems to be a fundamental assumption of the personalistic theory that all probabilities are comparable. Moreover, so far as I understand it, we are not allowed to attach measures of precision to probabilities. They are as they are … I understand Dennis Lindley’ s irritation at the cry ‘ where did the prior come from?’ I hope that it is clear that my objection is rather different: why should I be interested in someone else’ s prior and why should anyone else be interested in mine? (ibid. p. 323) … [I]n my view the personalistic probability is virtually worthless for reasoned discussion unless it is based on information, often directly or indirectly of a broadly frequentist kind. … For example, how often have very broadly comparable laboratory studies been misleading as regards human health? How distant are the laboratory studies from a direct process affecting health?” (ibid., p. 322)

  2. Non-ampliative.

  David Sprott : “ This paper relegates statistical and scientific inference to a branch (probability) of pure mathematics, where inferences are deductive statements of implication: if H I then H 2 . This can say nothing about whether there is reproducible objective empirical evidence for H I or H 2 , as is required by a scientific inference.” (ibid., p. 331)

  3. Science is Open-Ended.

  John Nelder : “ Statistical science is not just about the study of uncertainty, but rather deals with inferences about scientific theories from uncertain data. … [Theories] are essentially open ended; at any time someone may come along and produce a new theory outside the current set. This contrasts with probability, where to calculate a specific probability it is necessary to have a boun
ded universe of possibilities over which the probabilities are defined. When there is intrinsic open-endedness it is not enough to have a residual class of all the theories that I have not thought of yet [the catchall].” (ibid., p. 324)

  David Sprott : “ Bayes’ s Theorem (1) requires that all possibilities H I , H 2 , … , Hk be specified in advance, along with their prior probabilities. Any new, hitherto unthought of hypothesis or concept H will necessarily have zero prior probability. From Bayes’ s Theorem, H will then always have zero posterior probability no matter how strong the empirical evidence in favour of H .” (ibid., p. 331)

  4. Likelihood Principle.

  Brad Efron : “ The likelihood principle seems to be one of those ideas that is rigorously verifiable and yet wrong.” (Efron 2000 , p. 330) 2

  There are also supporters of course, notably, O’ Hagan and Dawid, whose remarks we take up elsewhere. The fact that classical Bayesianism reduces statistical inference to probability theory – the very reason many take it as a respite from the chaos of frequentism – could also, Dawid observes, be thought to make it boring: “ What is the principal distinction between Bayesian and classical statistics? It is that Bayesian statistics is fundamentally boring. There is so little to do: just specify the model and the prior, and turn the Bayesian handle.” (ibid., p. 326). He’ s teasing I’ m sure, but let’ s step back.

  The error statistician agrees with all these criticisms. In her view, statistics is collecting, modeling, and using data to make inferences about aspects of what produced them. Inferences, being error prone, are qualified by reports of the error probing capacities of the inferring method. There is a cluster of error types, real versus spurious effect, wrong magnitude for a parameter, violated statistical assumptions, and flaws in connecting formal statistical inference to substantive claims. It splits problems off piecemeal; there’ s no need for an exhaustive list of hypotheses that could explain data. Being able to directly pick up on gambits like cherry picking and optional stopping is essential for an account to be up to the epistemological task of determining if claims are poorly tested. While for Lindley this leads to incoherence (violations of the likelihood principle), for us it is the key to assessing if your tool is capable of deceptions. According to Efron: “ The two philosophies, Bayesian and frequentist, are more orthogonal than antithetical” (Efron 2013 , p. 145). Given the radical difference in goals between classical Bayesians and classical frequentists, he might be right. Vive la difference!

  But Now Things Have Changed

  What should we say now that the landscape has changed? That’ s what we’ ll explore in Excursion 6. We’ ll drop in on some sites we only visited briefly or passed up the first time around. We attempt to disinter the statistical philosophy practiced by the most popular of Bayesian tribes, those using non-subjective or default priors, picking up on Section 1.3 , “ The Current State of Play” . Around 20 years ago, it began to be conceded that: “ non-informative priors do not exist” (Bernardo 1997 ). In effect, they couldn’ t transcend the problems of “ the principle of indifference” wherein lacking a reason to distinguish the probability of different values of θ is taken to render them all equally probable. The definitive review of default methods in statistics is Kass and Wasserman (1996 ). The default/non-subjective Bayesian focuses on priors that, in some sense, give heaviest weight to data. Impressive technical complexities notwithstanding, there’ s a multiplicity of incompatible ways to go about this job, none obviously superior. The problem is redolent of Carnap’ s problem of being faced with a continuum of inductive logics (Section 2.1 ). (A few are maximum entropy, invariance, maximizing the missing information, coverage matching.) Even for simple problems, recommended default Bayesian procedures differ.

  If the proponents of this view thought their choice of a canonical prior were intellectually compelling, they would not feel attracted to a call for an internationally agreed convention on the subject, as have Berger and Bernardo (1992 , p. 57) and Jeffreys (1955 , p. 277).

  (Kadane 2011 , p. 445– 6)

  No such convention has been held.

  Default/non-subjective Bayesianism is often offered as a way to unify Bayesian and frequentist approaches. 3 It gives frequentist error statisticians a clearer and less contentious (re)entry into statistical foundations than when Bayesian “ personalists” reigned (e.g., Lindley, Savage). At an earlier time, as Cox tells it, confronted with the position that “ arguments for this personalistic theory were so persuasive that anything to any extent inconsistent with that theory should be discarded” (Cox 2006a , p. 196), frequentists might have felt alienated when it came to foundations. The discourse was snarky and divisive. Nowadays, Bayesians are more diffident. It’ s not that unusual to hear Bayesians admit that the older appeal to ideals of rationality were hyped. Listen to passages from Gelman (2011 ), Kass (2011 ), and Spiegelhalter (2004 ):

  Frequentists just took subjective Bayesians at their word and quite naturally concluded that Bayesians had achieved the goal of coherence only by abandoning scientific objectivity. Every time a prominent Bayesian published an article on the unsoundness of p-values, this became confirming evidence of the hypothesis that Bayesian inference operated in a subjective zone bounded by the prior distribution.

  (Gelman 2011 , p. 71)

  [T]he introduction of prior distributions may not have been the central bothersome issue it was made out to be. Instead, it seems to me, the really troubling point for frequentists has been the Bayesian claim to a philosophical high ground, where compelling inferences could be delivered at negligible logical cost.

  (Kass 2011 , p. 6)

  The general statistical community, who are not stupid, have justifiably found somewhat tiresome the tone of hectoring self-righteousness that has often come from the Bayesian lobby. Fortunately that period seems to be coming to a close, and with luck the time has come for the appropriate use of Bayesian thinking to be pragmatically established.

  (Spiegelhalter 2004 , p. 172)

  Bayesian empathy with objections to subjective foundations – “ we feel your pain” – is a big deal, and still rather new to this traveler’ s ears. What’ s the new game all about? There’ s an important thread that needs to be woven into any answer. Not so long after the retreat from classical subjective Bayes, though it’ s impossible to give dates (early 2000s?), we saw the rise of irreproducible results and the statistical crisis in science. A new landscape of statistical conflict followed, but grew largely divorced from the older Bayesian– frequentist battles. “ Younger readers … may not be fully aware of the passionate battles over Bayesian inference among statisticians in the last half of the twentieth century” (Gelman and Robert 2013 , p. 1). Opening with Lindley’ s statistical philosophy lets us launch into newer battles. Finding traditional Bayesian foundations ripped open, coupled with invitations for Bayesian panaceas to the reproducibility crisis, we are swept into a dizzying whirlpool where deeper and more enigmatic puzzles swirl. Do you still have that quicksand stick? (Section 3.6 ) Grab it and join me on some default, pragmatic, and eclectic Bayesian pathways.

  (Note: Since we are discussing existing frequentist– Bayesian arguments, I’ ll usually use “ frequentism” in this excursion, rather than our preferred error statistics.)

  6.2 What Are Bayesian Priors? A Gallimaufry

  The prevalent Bayesian position might be said to be: there are shortcomings or worse in standard frequentist methods, but classical subjective Bayesianism is, well, too subjective, so default Bayesianism should be used. Yet when you enter default Bayesian territory you’ ll need to juggle a plethora of competing meanings given to Bayesian priors, and consequently to posteriors.

  To show you what I mean, look at a text by Ghosh, Delampady, and Samanta (2010 ): They say they will stress “ objective” (default) priors, “ because it still seems difficult to elicit fully subjective priors … If a fully subjective prior is available we would indeed use it” (p. 36). Can we slip in and out of non-subjective and subject
ive priors so easily? Several contemporary Bayesian texts say yes. How should a default prior be construed? Ghosh et al. say that “ it represents a shared belief or shared convention,” while on the same page it is “ to represent small or no information” (p. 30). Maybe it can be all three. The seminal points to keep in mind are spelled out by Bernardo:

  By definition, ‘ non-subjective’ prior distributions are not intended to describe personal beliefs, and in most cases, they are not even proper probability distributions in that they often do not integrate [to] one. Technically they are only positive functions to be formally used in Bayes’ theorem to obtain ‘ non-subjective posteriors’ …

  (Bernardo 1997 , pp. 159– 60)

  Bernardo depicts them as a convention chosen “ to make precise the type of prior knowledge which” for a given inference problem within a model “ would make the data dominant” (ibid, p. 163). Can you just hear Fisher reply (as he did about washing out of priors), “ we may well ask what [the prior] is doing in our reasoning at all” (1934b , p. 287). Bernardo might retort: They are merely formal tools “ which, for a given model , are supposed to describe whatever the data ‘ have to say’ about some particular quantity ” (1997 , p. 160). The desire for an inductive logic of probabilism is familiar to us. Statistician Christian Robert echoes this sentiment: “ Having a prior attached to [a parameter θ ] has nothing to do with ‘ reality,’ it is a reference measure that is necessary for making probability statements” (2011 , pp. 317– 18). How then do we interpret the posterior, Cox asks? “ If the prior is only a formal device and not to be interpreted as a probability, what interpretation is justified for the posterior as an adequate summary of information?” (2006a , p. 77) 4

 

‹ Prev