Statistical Inference as Severe Testing

Page 60

by Deborah G Mayo

What then of Kass’ s attempt at peace? It may well describe a large tribe of “ contemporary attitudes.” As Larry Wasserman points out, if you ask people if a 95% CI means 95% of the time the method used gets it right, they will say yes. If you ask if 0.95 describes their belief in a resulting estimate, they may say yes too (2012c ). But Kass was looking for respite from conceptual confusion. Maybe all Kass is saying is that in the special case of frequentist matching priors, the split personality scarcely registers, and no one will notice.

A philosophy limited to frequentist matching results would very much restrict the Bayesian edifice. Notice we’ ve been keeping to the simple examples on which nearly all statistics battles are based. Here matching is at least in principle possible. Fraser (2011 , p. 299) shows that when we move away from examples on “ location,” as with mean μ , the matching goes by the board. We cannot be accused of looking to examples where the frequentist and Bayesian numbers necessarily diverge, no “ gotcha” examples from our side. Nevertheless, they still diverge because of differences in meaning and goals.

Optional Stopping and Bayesian Intervals

Disagreement on numbers can also be due to different attitudes toward gambits that alter error probabilities but not likelihoods. Way back in Section 1.5, we illustrated a two-sided Normal test i.e., H 0 : µ = 0 vs. H 1 : µ ≠ 0, σ = 1 , with a rule that keeps sampling until H 0 is rejected. At the 1962 Savage Forum, Armitage needled Savage that the same thing happens with Bayesian methods:

The departure of the mean by two standard errors corresponds to the ordinary five per cent level. It also corresponds to the null hypothesis being at the five per cent point of the posterior distribution.

(Armitage 1962 , p. 72)

The identical point can be framed in terms of the corresponding 95% confidence interval method. I follow Berger and Wolpert (1988 , pp. 80– 1).

Suppose a default/non-subjective Bayesian [they call him an “ objective conditionalist” ] states that with a fixed sample size n , he would use the interval

He “ would not interpret confidence in the frequency sense” but instead would “ use a posterior Bayesian viewpoint with the non-informative prior density π ( θ ) = 1, which leads to a posterior” [given the variance is 1] (ibid., p. 80).

Consider the rule: keep sampling until the 95% confidence interval excludes 0.

Berger and Wolpert concede that the Bayesian “ being bound to ignore the stopping rule will still use [the above interval] as his confidence interval, but this can never contain zero” (ibid., p. 81). The experimenter using this stopping rule “ has thus succeeded in getting the [Bayesian] conditionalist to perceive that θ ≠ 0, and has done so honestly” (ibid.). This is so despite the Bayesian interval assigning a probability of 0.95 to the interval estimate. “ The ‘ misleading,’ however, is solely from a frequentist viewpoint, and will not be of concern to a conditionalist” (ibid.). Why are they unconcerned?

It’ s hard to pin down their response; they go on to other examples. 3 I take it they are unconcerned because they are not in the business of computing frequentist error probabilities. From their perspective, taking into account the stopping rule is tantamount to taking account of the experimenter’ s intentions (when to stop). Moreover, from the perspective of what the agent believes, Berger and Wolpert explain, he is not really being misled. They further suggest we should trust our intuitions about the Likelihood Principle in simple situations “ rather than in extremely complex situations such as” with our stopping rule (ibid., p. 83).

As they surmise, this won’ t “ satisfy a frequentist’ s violated intuition” (ibid.). It kills her linchpin for critically interpreting the data. 4

It isn’ t that the stopping rule problem is such a big deal; but it’ s routinely given as an exemplar in favor of ignoring the sampling distribution, and Berger and Wolpert (1988 ) is the standard to which Bayesian texts refer. Ironically, as Roderick Little observes: “ This example is cited as a counterexample by both Bayesians and frequentists! If we statisticians can’ t agree which theory this example is counter to, what is a clinician to make of this debate?” (Little 2006 , p. 215). But by 2006, Berger embraces default priors that are model dependent, “ leading to violations of basic principles, such as the likelihood principle and the stopping rule principle” (Berger 2006 , p. 394). This is all part of the abandonment of Bayesian foundations. Still, he admits to having “ trouble with saying that good frequentist coverage is a necessary requirement” (ibid., p. 463). Conditioning on the data, he says, is more important. “ Since the calibrated Bayes inference is Bayesian, there are no penalties for peeking…” (Little 2006 , p. 220). The disparate attitudes in default/non-subjective Bayesian texts are common – even in the same book.

For example, at the opening of Ghosh et al. (2010 ) we hear of the “ stopping rule paradox in classical inference” (p. 38), happily avoided by the Bayesian. Then later on it’ s granted: “ Inference based on objective [default] priors does violate the stopping rule principle, which is closely related to the likelihood principle” (p. 148). Arguably, any texts touting the stopping rule principle, while using default priors that violate it, should do likewise. Yet this doesn’ t bring conceptual clarity to their readers. Why are they upending their core principle? 5 Readers are assured that the violations of the Likelihood Principle are “ minor,” so the authors’ hearts aren’ t in it. An error statistician wants major violations of a principle that denies the relevance of error probabilities. If the violation is merely an unfortunate quirk growing from the desire for a post-data probabilism, but with default priors, it may seem they gave up on those splendid foundations too readily. The non-subjective Bayesian might be seen as left in the worst of both frequentist and Bayesian worlds. They violate dearly held principles and relinquish the simplicity of classical Bayesianism, while coming up short on error statistical performance properties.

6.6 Error Statistical Bayesians: Falsificationist Bayesians

A final view, which can be seen as either more middle of the road or more extreme, is that of Andrew Gelman (writing both on his own and with others). It’ s the former for an error statistician, the latter for a Bayesian. I focus on what Gelman calls “ falsificationist Bayesianism” as developed in Gelman and Shalizi (2013 ). You could see it as the outgrowth of an error-statistical-Bayesian pow-wow: Shalizi being an error statistician and Gelman a Bayesian. But it’ s consistent with Gelman’ s general (Bayesian) research program both before and after, and I don’ t think anyone can come away from it without recognizing how much statistical foundations are currently in flux. I begin by listing three striking points in their work (Mayo 2013b ).

(1) Methodology is ineluctably bound up with philosophy. If nothing else, “ strictures derived from philosophy can inhibit research progress” (Gelman and Shalizi 2013 , p. 11), as when Bayesians are reluctant to test their models either because they assume they’ re subjective, or that checking involved non-Bayesian methods (Section 4.9 ).

(2) Bayesian methods need a new foundation. Although the subjective Bayesian philosophy, “ strongly influenced by Savage (1954 ), is widespread and influential in the philosophy of science (especially in the form of Bayesian confirmation theory… ),” and while many practitioners see the “ rising use of Bayesian methods in applied statistical work” (ibid., p. 9) as supporting this Bayesian philosophy, the authors flatly declare that “ most of the standard philosophy of Bayes is wrong” (p. 10, n. 2). While granting that “ a statistical method can be useful even if its common philosophical justification is in error” (ibid.), their stance will rightly challenge many a Bayesian. This is especially so in considering their third thesis.

(3) The new foundation uses error statistical ideas. While at first professing that their “ perspective is not new,” but rather follows many other statisticians in regarding “ the value of Bayesian inference as an approach for obtaining statistical methods with good frequency properties” (p. 10), they admit they are “ g
oing beyond the evaluation of Bayesian methods based on their frequency properties – as recommended by Rubin (1984 ), Wasserman (2006 ), among others” (p. 21). “ Indeed, crucial parts of Bayesian data analysis, such as model checking, can be understood as ‘ error probes’ in Mayo’ s sense” (p. 10), which might be seen as using modern statistics to implement the Popperian criteria of severe tests.

Testing in Their Data-Analysis Cycle

Testing in their “ data-analysis cycle” involves a “ non-Bayesian checking of Bayesian models.” This is akin to Box’ s eclecticism, except that their statistical analysis is used “ not for computing the posterior probability that any particular model was true – we never actually did that” (p. 13), but rather “ to fit rich enough models” and upon discerning that aspects of the model “ did not fit our data” , to build a more complex and better fitting model, which in turn called for alteration when faced with new data. They look to “ pure significance testing” (p. 20) with just a null hypothesis, where no specific alternative models are considered. In testing for misspecifications, we saw, the null hypothesis asserts that a given model is adequate, and a relevant test statistic is sought whose distribution may be computed, at least under the null hypothesis (Section 4.9 ).

They describe their P -values as “ generalizations of classical p -values, replacing point estimates of parameters θ with averages over the posterior distribution…” (p. 18). If a pivotal characteristic is available, their approach matches the usual significance test. The difference is that where the error statistician estimates the “ nuisance” parameters, they supply them with priors. It may require a whole hierarchy of priors so that, once integrated out, you are left assigning probabilities to a distance measure d( X ). This allows complex modeling, and the distance measures aren’ t limited to those independent of unknowns, but if it seems hard to picture the distribution of μ , try to picture the distribution of a whole iteration of parameters. These will typically be default priors, with their various methods and meanings as we’ ve seen. The approach is a variant on the “ Bayesian P -value” research program, developed by Gelman, Meng, and Stern (1996 ) (Section 4.8 ). The role of the sampling distribution is played by what they call the posterior predictive distribution . The usual idea of the sampling distribution now refers to the probability of d( X ) > d( x ) in future replications, averaging over all the priors. These are generally computed by simulation. Their P -value is a kind of error probability. Like the severe tester, I take it the concern is well-testedness, not ensuring good long-run performance decisions about misspecification. So, to put it in severe testing terms, if the model, which now includes a series of priors, was highly incapable of generating results so extreme, they infer a statistical indication of inconsistency. Some claim that, at least for large sample sizes, Gelman’ s approach leads essentially to “ rediscovering” frequentist P -values (Ghosh et al. 2010 , pp. 181– 2; Bayarri and Berger 2004 ), which may well be the reason we so often agree. Moreover, as Gelman and Shalizi (2013 , p. 18, n. 11) observe, all participants in the Bayesian P -value program implicitly “ disagree with the standard inductive view” of Bayesianism – at least insofar as they are engaged in model checking (ibid.).

Non-significant results: They compute whether “ the observed data set is the kind of thing that the fitted model produces with reasonably high probability” assuming the replicated data are of the same size and shape as y 0 , “ generated under the assumption that the fitted model, prior and likelihood both, is true” (2013, pp. 17– 18). If the Bayesian P -value is reasonably high, then the data are “ unsurprising if the model is true” (ibid., p. 18). However, as the authors themselves note, “ whether this is evidence for the usefulness of the model depends how likely it is to get such a high p- value when the model is false: the ‘ severity’ of the test” (ibid.) associated with H : the adequacy of the combined model with prior. I’ m not sure if they entertain alternatives needed for this purpose.

Significant results : A small P -value, on the other hand, is taken as evidence of incompatibility between model and data. The question that arises here is: what kind of incompatibility are we allowed to say this is evidence of? The particular choice of test statistic goes hand in hand with a type of discrepancy from the null. But this does not exhaust the possibilities. It would be fallacious to take a small P -value as directly giving evidence for a specific alternative that “ explains” the effect – at least not without further work to pass the alternative with severity. (See Cox’ s taxonomy, Section 3.3 .) Else there’ s a danger of a fallacy of rejection we’ re keen to avoid.

What lets the ordinary P -value work in M-S tests is that the null, a mere implicationary assumption, allows a low P -value to point the finger at the null – as a hypothesis about what generated this data. Can’ t they still see their null hypothesis as implicationary? Yes, but indicating a misfit doesn’ t pinpoint what’ s warranted to blame. The problem traces back to the split personality involved in interpreting priors. In most Bayesian model testing, it seems the prior is kept sufficiently vague (using default priors), so that the main work is finding flaws in the likelihood part of the model. They claim their methods provide ways for priors to be tested. To check if something is satisfying its role, we had better be clear on what its intended role is. Gelman and Shalizi tell us what a prior need not be: It need not be a default prior (p. 19), nor need it represent a statistician’ s beliefs. They suggest the model combines the prior and the likelihood “ each of which represents some compromise among scientific knowledge, mathematical convenience, and computational tractability” (p. 20). It may be “ a regularization device,” (p. 19) to smooth the likelihood, making fitted models less sensitive to details of the data. So if the prior fails the goodness-of-fit test, it could mean it represented false beliefs, or that it was not so convenient after all, or … ? Duhemian problems loom large; there are all kinds of things one might consider changing to make it all fit.

There is no difficulty with the prior serving different functions, so long as its particular role is pinned down for the given case at hand. 6 Since Gelman regards the test as error statistical, it might work to use the problem solving variation on severe testing (Souvenir U), the problem, I surmise, being one of prediction. For prediction, it might be that the difference between M-S testing and Gelman’ s tests is more of a technical issue (about the best way to deal with nuisance parameters), and less a matter of foundations, on which we often seem to agree. I don’ t want to downplay differences and others are better equipped to locate them. 7

What’ s most provocative and welcome is that by moving away from probabilisms (posteriors and Bayes factors) and inviting error statistical falsification, we alter the conception of which uses of probability are direct, and which indirect. 8 In Gelman and Hennig (2017 ), “ falsificationist Bayesianism” is described as:

a philosophy that openly deviates from both objectivist and subjectivist Bayesianism, integrating Bayesian methodology with an interpretation of probability that can be seen as frequentist in a wide sense and with an error statistical approach to testing assumptions …

(p. 991)

In their view, falsification requires something other than probabilism of any type. “ Plausibility and belief models can be modified by data in ways that are specified a priori , but they cannot be falsified by data” (p. 991). Actually any Bayesian (or even Likelihoodist) account can become falsificationist, indirectly, by adding a falsification rule – provided it has satisfactory error probabilities. But Gelman and Hennig are right that subjective and default Bayesianism, in current formulations, do not falsify, although they can undergo prior redos or shifts. The Bayesian probabilist regards error probabilities as indirect because they seek a posterior; for the Bayesian falsificationist, like the severe tester, the shoe is on the other foot.

Souvenir Z: Understanding Tribal Warfare

We began this tour asking: Is there an overarching philosophy that “ matches contemporary at
titudes” ? More important is changing attitudes. Not to encourage a switch of tribes, or even a tribal truce, but something more modest and actually achievable: to understand and get beyond the tribal warfare. To understand them, at minimum, requires grasping how the goals of probabilism differ from those of probativeness. This leads to a way of changing contemporary attitudes that is bolder and more challenging. Snapshots from the error statistical lens let you see how frequentist methods supply tools for controlling and assessing how well or poorly warranted claims are. All of the links, from data generation to modeling, to statistical inference and from there to substantive research claims, fall into place within this statistical philosophy. If this is close to being a useful way to interpret a cluster of methods, then the change in contemporary attitudes is radical: it has never been explicitly unveiled. Our journey was restricted to simple examples because those are the ones fought over in decades of statistical battles. Much more work is needed. Those grappling with applied problems are best suited to develop these ideas, and see where they may lead. I never promised, when you bought your ticket for this passage, to go beyond showing that viewing statistics as severe testing will let you get beyond the statistics wars.

6.7 Farewell Keepsake

Despite the eclecticism of statistical practice, conflicting views about the roles of probability and the nature of statistical inference – holdovers from long-standing frequentist– Bayesian battles – still simmer below the surface of today’ s debates. Reluctance to reopen wounds from old battles has allowed them to fester. To assume all we need is an agreement on numbers – even if they’ re measuring different things – leads to statistical schizophrenia. Rival conceptions of the nature of statistical inference show up unannounced in the problems of scientific integrity, irreproducibility, and questionable research practices, and in proposed methodological reforms. If you don’ t understand the assumptions behind proposed reforms, their ramifications for statistical practice remain hidden from you.

‹ Prev Next ›