Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 59
Statistical Inference as Severe Testing Page 59

by Deborah G Mayo


  Nearly all tribes are becoming aware that today’ s practice isn’ t captured by tenets of classical probabilism. Even some subjective Bayesians, we saw, question updating by Bayes’ Rule. Temporal incoherence can require a do-over. The most appealing aspects of non-subjective/default Bayesianism – a way to put in background information while allowing the data to dominate – are in tension with each other, and with updating. The gallimaufry of priors alone is an obstacle to scrutinizing the offerings. There are a few tribes where brand new foundations are being sought – that’ s our last port of call.

  1 “ Frequency, however, is not adequate because there is ordinarily no repetition of parameters; they have unique unknown values … with the result that it has been necessary for them to develop incoherent concepts like confidence intervals.” (p. 311) There are, however, repetitions of types of methods.

  2 I have argued (e.g., Mayo 2014b) the alleged verifications are circular. Efron, in private communication, said that he tried to argue against the result, but gave up; he was glad I did not.

  3 Since we’ ll be talking a lot about default Bayesians in this tour, I’ ll use “ default/non-subjective” lest I be seen as taking away an appealing name.

  4 “ When the parameter space is finite it [Bernardo reference priors] produces the maximum entropy prior of E. T. Jaynes and, for a one-dimensional parameter, the Jeffreys prior” (Cox 2006b , p. 6).

  5 The default Bayesian needn’ t give probability 1 to data, but it’ s unclear how they proceed with Bayes’ Rule or other computations with a probability on the data and assumptions. Rejecting this possibility, Box and others use frequentist methods for model testing.

  6 One way to link the LP violation with Bayesian incoherence is to show that the posterior depends on the order of two independent experiments for the same parameter. We know the Binomial and Negative Binomial experiments have different sample spaces (Section 4.9 ), and yet are not distinguished on the LP. Default priors, however, are sample-space dependent. If the first experiment is Binomial and the second Negative Binomial, both for inferences about the probability of success on each trial, a different posterior results depending on the order that the default rule is applied. Excellent discussions are in Seidenfeld (1979 ) and Kass and Wasserman (1996 , p. 1359). Note that some consider that coherence only concerns the assignment of the prior; a violation of Bayes’ Rule is called a failure of Bayesian conditionalization.

  7 I’ ve no objection to Berger’ s viewing “ probability as a primitive concept” (p. 385). Theoretical concepts may arise in models and receive their explication through applications. It’ s problematic when the subsequent meanings shift, as happens with default probabilities.

  8 The noteworthy Carnapian part is his relativization to first order languages, rather than to statistical models.

  9 They have in mind maximal entropy methods advanced by Jaynes and also developed by Roger Rosenkrantz (1977 ). It is thought to work well in contexts where the current experiment may be seen as a typical instance of a known physical θ -generating process.

  Tour II

  Pragmatic and Error Statistical Bayesians

  6.5 Pragmatic Bayesians

  The protracted battle for the foundations of statistics, joined vociferously by Fisher, Jeffreys, Neyman, Savage and many disciples, has been deeply illuminating, but it has left statistics without a philosophy that matches contemporary attitudes.

  (Robert Kass 2011 , p. 1)

  Is there a philosophy that “ matches contemporary attitudes” ? Worried that “ our textbook explanations have not caught up with the eclecticism of statistical practice,” Robert Kass puts forward a statistical pragmatism “ as a foundation for inference” (ibid.) as now practiced. It reflects the current disinclination to embrace subjective Bayes, skirts the rigid behavioral decision model of frequentist inference, and hangs on to a kind of frequentist calibration. “ Subjectivism was never satisfying as a logical framework: an important purpose of the scientific enterprise is to go beyond personal decision-making” (ibid., p. 6). Nevertheless, Kass thinks “ it became clear, especially from the arguments of Savage … the only solid foundation for Bayesianism is subjective” (ibid., p. 7). Statistical pragmatism pulls us out of that “ solipsistic quagmire” (ibid.). Statistical pragmatism, says Kass, holds that confidence intervals, statistical significance, and posterior probability are all valuable tools. So long as we recognize that our statistical models exist in an abstract theoretical world, Kass avers, we can safely employ methods from either tribe, retaining unobjectionable common denominators: frequencies without long runs and Bayesianism without the subjectivity.

  The advantage of the pragmatic framework is that it considers frequentist and Bayesian inference to be equally respectable and allows us to have a consistent interpretation, without feeling as if we must have split personalities in order to be competent statisticians.

  (ibid., p. 7)

  Kass offers a valuable analysis of the conundrums of members of today’ s eclectic and non-subjective/default Bayesian tribes. We want to expose some hidden layers only visible after peeling away a more familiar mindset. Do we escape split personality? Or has he, perhaps inadvertently, explained the necessity for a degree of schizophrenia in current electic practice? That’ s where today’ s journey begins.

  Kass and the Pragmatists

  Bayes– frequentist agreement is typically closer with confidence intervals rather than tests. Kass (2011 ) uses a “ paradigm case of confidence and posterior intervals for a Normal mean based on a sample of size n [49], with the standard deviation [1] being known” (ibid., pp. 3– 4). Both the frequentist and the Bayesian arrive at the same interval, but he’ ll supply “ mildly altered interpretations of frequentist and Bayesian inference” (ibid., p. 3).

  We assume X 1 , X 2 , … , X 49 are IID random variables from N ( μ , 1). The 0.95 two-sided confidence interval estimator, we know is:

  Following Kass, take Z 0.025 to be 2 (instead of the more accurate 1.96):

  We observe . Plugging in 1/7 = 0.14, 2/7 = 0.28 , the particular interval estimate I is

  (10.2 − 2/7, 10.2 + 2/7) = [9.92, 10.48].

  He contrasts the usual frequentist interpretation with the pragmatic one he recommends:

  FREQUENTIST INTERPRETATION… Under the assumptions above, if we were to draw infinitely many random samples from a N ( μ ,1) distribution, 95% of the corresponding confidence intervals ( , ) would cover μ .

  (ibid., p. 4)

  PRAGMATIC INTERPRETION… If we were to draw a random sample according to the assumptions above, the resulting confidence interval ( , ) would have probability 0.95 of covering μ . Because the random sample lives in the theoretical world, this is a theoretical statement. Nonetheless, substituting … we obtain the interval I , and are able to draw useful conclusions as long as our theoretical world is aligned well with the real world that produced the data.

  (ibid.)

  Kass’ s pragmatic Bayesian treatment runs parallel to his construal of the frequentist, except that the Bayesian also assumes a prior distribution for parameter μ . It is given its own mean μ 0 and variance τ 2 .

  BAYESIAN ASSUMPTIONS: Suppose X 1 , X 2 , … X n form a random sample from a N ( μ , 1) distribution and the prior distribution of μ is N ( μ 0 , τ 2 ), with τ 2 ≫ 1/49 and 49 τ 2 ≫ | μ 0 |.

  (ibid.)

  This conjugate prior might be invoked as what Berger called a “ prior completion strategy” or simply a choice of a prior enabling the data to sufficiently dominate. The results:

  Given the stipulations that τ 2 ≫ 1/49 and 49 τ 2 ≫ |μ 0 |, we get approximately the same interval as the frequentist, only it gets a posterior probability not just a coverage performance of 0.95.

  BAYESIAN INTERPRETATION… Under the assumptions above, the probability that μ is in the interval I is 0.95.

  PRAGMATIC INTERPRETATION … If the data were a random sample [as described and ], and if the assumptions above were to hold, then the p
robability that μ is in the interval I would be 0.95. This refers to a hypothetical value of the random variable , and because lives in the theoretical world the statement remains theoretical. Nonetheless, we are able to draw useful conclusions from the data as long as our theoretical world is aligned well with the real world that produced the data.

  (ibid., p. 4)

  We get an agreement on numbers and we can cross over the Bayesian– frequentist bridge with aplomb. Assuming, that is, we don’ t notice the crocodiles in the water waiting to bite us!

  Analysis.

  First off, the frequentist readily agrees with Kass that “ long-run frequencies may be regarded as consequences of the Law of Large Numbers rather than as part of the definition of probability or confidence” (p. 2). The probability is methodological: the proportion of correct estimates produced in hypothetical applications is 0.95. The long run need only be long enough to see the pattern emerge and is hypothetical. One can simulate the statistical mechanism associated with the model to produce realizations of the process on a computer, today, not in a long run (Spanos 2013c ).

  According to Kass (2011 ), “ the commonality between frequentist and Bayesian inferences is the use of theoretical assumptions …” (p. 4). The prior becomes part of the model in the Bayesian case. For both cases, “ When we use a statistical model to make a statistical inference we implicitly assert that the variation exhibited by data is captured reasonably well by the statistical model” (p. 2). Everything is left as a subjunctive conditional: if the “ theoretical world is aligned well with the real world that produced the data” (p. 4) then such and such follows. “ Perhaps we might even say that most practitioners are subjunctivists” (p. 7).

  Would either frequentists or Bayesians be content with this? The frequentist error statistician would not, because there’ s no detachment of an inference. It remains a subjunctive or ‘ would be’ claim that is entirely deductive, not ampliative. Moreover, she insists on checking the adequacy of the model. I doubt the Bayesian would be satisfied with life as a subjunctivist either. The payoff for the extra complexity in positing a prior is the ability to detach the probability that μ is in (10.2 − 2/7, 10.2+ 2/7) is 0.95. Thus, locating the common frequentist/Bayesian ground in assuming the “ theoretical world is aligned well with the real world that produced the data” doesn’ t get us very far, and it keeps key differences in goals under wraps. We can hear some Bayesian tribe members grumbling at the very assumption that they’ re modeling “ the real world that produced the data” (p. 4).

  A Deeper Analysis.

  We have developed enough muscle from our workouts to peel back some layers here. It only comes near the end of his paper, where Kass tells us what he says to his class (citing Brown and Kass 2009 ). I’ ll number his points, the highlight being (3):

  (1) I explicitly distinguish the use of probability to describe variation and to express knowledge. … [These] are sometimes considered to involve two different kinds of probability … ‘ aleatory probability’ and ‘ epistemic probability’ [the latter in the sense of] quantified belief … .

  (2) Bayesians merge these, applying the laws of probability to go from quantitative description to quantified belief (ibid., p. 5).

  (3) But in every form of statistical inference aleatory probability is used, somehow, to make epistemic statements (ibid., pp. 5– 6).

  What do the frequentist and Bayesian say to (1)– (3)? I can well imagine all tribes mounting resistance. Let’ s assume for the moment the model– theory match entitles the detachment. Then we can revisit the frequentist and Bayesian inferences.

  What does the frequentist do to get her epistemological claim according to Kass? Since the probability the estimator yields true estimates is 0.95 (performance), Kass’ s frequentist can assign probability 0.95 to the estimate. But this is a fallacy. Worse, it’ s to be understood as degree of belief (probabilism) rather than confidence (in the performance sense). If a frequentist is not to be robbed of a notion of statistical inference, “ epistemic” couldn’ t be limited to posterior probabilism. An informal notion of probability can work, but that’ s still to rob probability from playing its intended frequentist role.

  What about the pragmatic Bayesian? The pragmatic Bayesian infers ‘ I’ m 95% sure that μ is in (10.2 − 2/7, 10.2 + 2/7).’ The model, Kass says, gives the variability (here of μ ), and “ Bayesians merge these” (p. 5), variability with belief. That’ s what the theory– real-world match means for the Bayesian. One might rightly wonder if the permission to merge isn’ t tantamount to blessing the split personality we are to avoid.

  Suppose you are a student sitting in Professor Kass’ s class. Probability as representing random variability is quite different from its expression as degree of belief, or uncertainty of knowledge, Professor Kass begins in (1). Nevertheless, Kass will tell you how to juggle them. If you’ re a pragmatic frequentist, you get your inferential claim by slyly misinterpreting the confidence coefficient as a (degree of belief) probability on the estimate. If you’ re a pragmatic Bayesian, however, you merge these. Aha, relief. But won’ t you wonder at the rationale? Kass might say ‘ in the cases where the numbers match, viewing probability as both variability and belief is unproblematic’ for a Bayesian. What about where Bayesians and frequentist numbers don’ t match? Then “ statistical pragmatism is agnostic” (p. 7). In such cases, Kass avers, “ procedures should be judged according to their performance under … relevant real-world variation” (ibid.). But if the probabilistic assessment doesn’ t supply performance measures (as in the case of a mismatch), where do they come from?

  Bayesians are likely to bristle at the idea of adjudicating disagreement by appeal to frequentist performance; whereas a frequentist like Fraser (2011 ) argues that it’ s misleading to even use “ probability” if it doesn’ t have the performance sense of confidence. Christian Robert (commenting on Fraser) avers: “ the Bayesian perspective on confidence (or credible) regions and statements does not claim ‘ correct coverage’ from a frequentist viewpoint since it is articulated in terms of the parameters” (Robert 2011 , p. 317). He suggests, “ the chance identity occurring for location parameters is a coincidence” (ibid.), which raises doubts about a genuine consilience even in the case of frequentist matching. Even where they match, they mean different things.

  Frequentist error probabilities relate to the sampling distribution, where we consider hypothetically different outcomes that could have occurred in investigating this one system of interest. The Bayesian allusion to frequentist ‘ matching’ refers to the fixed data and considers frequencies over different systems (that could be investigated by a model like the one at hand).

  (Cox and Mayo 2010 , p. 302)

  These may be entirely different hypotheses, even in different fields, in reasoning about this particular H . A reply might be that, when there’ s matching, the frequentist uses error probability1 ; the Bayesian uses error probability2 (Section 3.6 ).

  The frequentist may welcome adjudication by performance, so she feels she’ s doing well by Kass. But she still must carve out a kosher epistemological construal, keeping only to frequency. Now we get to the crux of our entire journey, hinted at way back in Excursion 1 (Souvenir D), “ we may avoid the need for a different version of probability … by assessing the performance of proposed methods under hypothetical repetition” (Reid and Cox 2015 , p. 295). But how? The common answer is for the error probability of the method to rub off on a particular inference in the form of confidence. I grant this is an evidential or epistemological use of probability. 1 Yet the severe tester isn’ t quite happy with it. It’ s still too performance oriented: We detach the inference, and the performance measure qualifies it by the trustworthiness of the proceeding. Our assessment of well-testedness is different. It’ s the sampling distribution of the given experiment that informs us of the capabilities of the method to have unearthed erroneous interpretations of data. From here, we reason, counterfactually, to what is well and poor
ly warranted. What justifies this reasoning? The severity requirements, weak and strong. 2

  We’ ve often enough visited the severity interpretation of confidence intervals (Sections 3.7 and 4.3 ). Kass’ s 0.95 interval estimate is (9.92, 10.48). There’ s a good indication that μ > 9.92 because if μ were 9.92 (or lower), we very frequently would have gotten a smaller value of than we did. There’ s a poor indication that μ ≥ 10.2 because we’ d frequently observe values even larger than we did in a world where μ < 10.2 , indeed, we’ d expect this around 50% of the time. The evidence becomes indicative of μ < μ ′ as we move from 10.2 toward the upper bound 10.48. What would occur in hypothetical repetitions, under various claims about parameters, is not for future assurance in using the rule, but to understand what caused the events on which this inference is based. If a method is incapable of reflecting changes in the data-generating source, its inferences are criticized on grounds of severity. It’ s possible that no parameter values are ruled out with reasonable severity, and we are left with the entire interval.

 

‹ Prev