Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 58
Statistical Inference as Severe Testing Page 58

by Deborah G Mayo


  It may seem surprising for a subjective Bayesian like Howson to reject Bayes’ Rule: Typically, it’ s the subjectivist who recoils in finding the default/non-subjective tribes living in conflict with it. Jon Williamson, a non-subjective Bayesian philosopher in a Carnap-maximum entropy mold, 8 identifies the problem in these examples as stemming from two sources of probabilistic information (Williamson 2010 ). Relative frequency information tells us Pr′ (S) = 0.1, but also, since this is known, Pr′ (Pr′ (S) = 0.1) = 1. Bayes’ Rule holds, he allows, just when it holds. When there’ s a conflict with Bayes’ Rule, default Bayesian “ updating” takes place to reassign priors.

  The position of Howson and Williamson is altogether plausible if one is forced with the given assignments. Ian Hacking, who is not a Bayesian, sympathizes, and blames universal Bayesianism. Universal Bayesianism, Hacking (1965, p. 223) remarks, forces Savage (Savage 1962 , p. 16) to hold “ if you come to doubt [e.g., the Normality of a sample], you must always have had a little doubt” . To Hacking, it’ s plausible to be completely certain of something, betting the whole house and more, and later come to doubt it. All the more reason we should be loath to assign probability 1 to “ known” data, while seeking a posterior probabilism.

  Bayesian statisticians, at least of the default/non-subjective variety, follow suit, though for different reasons: “ Betting incoherency thus seems to be too strong a condition to apply to communication of information” (J. Berger 2006 , p. 395). Berger avers that even subjective Bayesianism is not coherent in practice, “ except for trivial versions such as always estimate θ ∈ (0, ∞ ) by 17.35426 (a coherent rule, but not one particularly attractive in practice)” (pp. 396– 7). His point appears to be that, while incoherence is part and parcel of default/non-subjective Bayesian accounts, in practice, idealizations lead the subjectivist to be incoherent as well. It gets worse: “ in practice, subjective Bayesians will virtually always experience what could be called practical marginalization paradoxes” (p. 397), where posteriors don’ t sum to 1. If this is so, it’ s very hard to see how they can be happy using any kind of probability logic.

  There are a great many complex twists and turns to the discussions of Dutch books; too many to do justice with a sample list.

  Can You Change Your Bayesian Prior?

  As an exercise in mathematics [computing a posterior based on the clients prior probabilities] it is not superior to showing the client the data, eliciting a posterior distribution and then calculating the prior distribution; as an exercise in inference Bayesian updating does not appear to have greater claims than ‘ downdating’ …

  (Senn 2011 , p. 59)

  If you could really express your uncertainty as a prior distribution, then you could just as well observe data and directly write your subjective posterior distribution, and there would be no need for statistical analysis at all.

  (Gelman 2011 , p. 77)

  Lindley’ s answer is that it would be a lot harder to be coherent that way, however, he’ s prepared to allow: “ [I]f a prior leads to an unacceptable posterior then I modify it to cohere with properties that seem desirable in the inference” (Lindley 1971 , p. 436). He resists saying the rejected prior was wrong though. Not wrong? No, just failing to cohere with desirable properties. I. J. Good (1971b ) advocated his device of “ imaginary results” whereby a subjective Bayesian would envisage all possible results in advance (p. 431) and choose a prior that she can live with regardless of results. Recognizing that his device is so difficult to apply that most are prepared to bend the rules, Good allowed “ that it is possible after all to change a prior in the light of actual experimental results” (ibid.) – appealing to an informal, second-order rationality of “ type II.”

  So can you change your Bayesian prior? I don’ t mean update it, but reject the one you had and replace it with another. I raised this question on my blog (June 18, 2015), hoping to learn what current practitioners think. Over 30 competing answers ensued (from over 100 comments), contributed by statisticians from different tribes. If the answer is yes you can, then how do they avoid the verification biases we are keen to block? Lindley seems to be saying it’ s quite open-ended. Cox says, “ there is nothing intrinsically inconsistent in changing prior assessments” in the light of data, however the danger is that “ even initially very surprising effects can post hoc be made to seem plausible” (Cox 2006b , p. 78). Berger had said elicitation typically takes place after “ the expert has already seen the data” (2006 , p. 392), a fact of life he understandably finds worrisome. If the prior is determined post-data, then it’ s not reflecting information independent of the data. All the work would have been done by the likelihoods, normalized to be in the form of a probability distribution or density. No wonder many look askance at changing priors based on the data.

  [N]o conceivable possible constellation of results can cause you to wish to change your prior distribution. If it does, you had the wrong prior distribution and this prior distribution would therefore have been wrong even for cases that did not leave you wishing to change it.

  (Senn 2011 , p. 63)

  The prior, after all, is “ like already having some data, but what statistical procedure would allow you to change your data?” (Senn 2015b ). As with Good’ s appeal to type II rationality, Senn is saying this is tantamount to admitting “ the informal has to come to the rescue of the formal” (Senn 2011 , p. 58), which would otherwise permit counterintuitive results. He makes the interesting point that the post-data adjustment of priors could conceivably be taken account of in the posterior: “ If you see there is a problem with the placeholder model and replace it, it may be that you can somehow reflect this ‘ sensible cheating’ in your posterior probabilities” (Senn 2015b ; see also Senn 2013a ). I think Senn is applying a frequentist error statistical mindset to the Bayesian analysis, wherein the posterior might be qualified by an error statistical assessment. Bayesians would need a principle to this effect.

  Dawid (2015 ) weighs in with his “ prequential” approach. “ In this approach the prior is constructed, not regarded as given in advance” . Maybe the idea is that the subjective Bayesian is trying to represent her psychological states; the posterior from the data indicate her first stab failed to do so, so it makes sense to change it. The main thing for Dawid is to have a coherent package; his Bayesian starts over with a better prior and a new test. But it’ s not obvious how you block yourself from engineering the result you want. Gelman and Hennig say “ priors in the subjectivist Bayesian conception are not open to falsification … because by definition they have to be fixed before observation” (2017 , p. 989). Now Howson’ s Bayesian changes his mind post-data, but admittedly this is not the same as falsification. In his comment to the blog discussion, Gelman (2015 ) says that “ if some of the [posterior] inferences don’ t ‘ make sense,’ this implies that you have additional information that has not been incorporated into the model” and it should be improved. But not making sense might just mean that more information would be necessary to get an answer, not that you rightfully have it. Shouldn’ t we worry that among the many ways you fix things, you choose one that protects or enhances a favored view, even if poorly probed? A reply might be that frequentists worry about data-dependent adjustments as well. There’ s one big difference.

  In order for a methodological “ worry” to be part of an inference account, it needs an explicit rationale, not generally found in contemporary Bayesianism – though Gelman is an exception. An error statistician changes her model in order to ensure the reported error probabilities are close to the actual ones (whether for performance or severe testing). There seem to be at least two situations where the default/non-subjective Bayesian may start over: The first, already noted, is when there’ s a conflict with Bayes’ Rule. “ Updating” takes place by going back to assign new prior probabilities using a chosen default prior. Philosophers Gaifman and Vasudevan (2012 ) describe it thus: “… the revision of an agent’ s [rational] subjective probabilities procee
ds in fits and starts, with periods of conditionalization punctuated by abrupt alterations of the prior” (p. 170). 9 Why wouldn’ t this be taken as questioning the entire method of reaching a default prior assignment? Surely it relinquishes a key feature Bayesianism claims to provide: a method of accumulating and updating knowledge probabilistically. A second situation might be finding information that statistical assumptions are violated. But this brings up Duhemian problems as we’ ll see in Section 6.6.

  The Bayesian Catchall

  The key obstacle to probabilistic updating, and to viewing an evidential assessment at a given time in terms of a posterior probabilism, is the Bayesian catchall hypothesis. One is supposed to save some probability for a catchall hypothesis: “ everything else,” in case new hypotheses are introduced, which they certainly will be. Follow me to the gallery on the 1962 Savage Forum for a snippet from the discussion between Savage and Barnard (Savage 1962 , pp. 79– 84):

  Barnard : … Professor Savage, as I understood him, said earlier that a difference between likelihoods and probabilities was that probabilities would normalize because they integrate to one, whereas likelihoods will not. Now probabilities integrate to one only if all possibilities are taken into account. This requires in its application to the probability of hypotheses that we should be in a position to enumerate all possible hypotheses which might explain a given set of data. Now I think it is just not true that we ever can enumerate all possible hypotheses. … If this is so we ought to allow that in addition to the hypotheses that we really consider we should allow something that we had not thought of yet, and of course as soon as we do this we lose the normalizing factor of the probability, and from that point of view probability has no advantage over likelihood.

  (p. 80)

  Savage : … The list can, however, always be completed by tacking on a catchall ‘ something else.’ … In practice, the probability of a specified datum given ‘ something else’ is likely to be particularly vague – an unpleasant reality. The probability of ‘ something else’ is also meaningful of course, and usually, though perhaps poorly defined, it is definitely very small.

  Barnard : Professor Savage says in effect, ‘ add at the bottom of list H 1 , H 2 , …‘ something else’ .’ But what is the probability that a penny comes up heads given the hypothesis ‘ something else.’ We do not know.

  Suppose a researcher makes the catchall probability small, as Savage recommends, and yet the true hypothesis is not in the set so far envisaged, call this set H . Little by little, data might erode the probabilities in H , but it could take a very long time until the catchall is probable enough so that a researcher begins to develop new theories. On the other hand, if a researcher suspects the existing hypothesis set H is inadequate, she might give the catchall a high prior. In that case, Barnard points out, it may be that none of the available hypotheses in H get a high posterior, even if one or more are adequate. Perhaps by suitably restricting the space (“ small worlds” ) this can work, but the idea of inference as continually updating goes by the board.

  The open-endedness of science is essential – as pointed out by Nelder and Sprott. The severe tester agrees. Posterior probabilism, with its single probability pie, is inimical to scientific discovery. Barnard’ s point at the Savage Forum was, why not settle for comparative likelihoods? I think he has a point, but for error control, that limited us to predesignated hypotheses. Nelder was a Likelihoodist and there’ s a lot of new work that goes beyond Royall’ s Likelihoodism – suitable for future journeys. The error statistician still seeks an account of severe testing, and it’ s hard to see that comparativism can ever give that. Despite science’ s open-endedness, hypotheses can pass tests with high severity. Accompanying reports of poorly tested claims point the way to novel theories. Remember Neyman’ s modeling the variation in larvae hatched from moth eggs (Section 4.8 )? As Donald Gillies (2001 ) stresses, “ Neyman did not consider any hypotheses other than that of the Poisson distribution” (p. 366) until it was refuted by statistical tests, which stimulated developing alternatives.

  Yet it is difficult to see how all these changes in degrees of belief by Bayesian conditionalisation could have produced the solution to the problem, … The Bayesian mechanism seems capable of doing no more than change the statistician’ s degree of belief in particular values of λ [in the Poisson distribution].

  (Gillies 2001 , p. 367)

  At the stage of inventing new models, Box had said, the Bayesian should call in frequentist tests. This is also how GTR and HEP scientists set out to extend their theories into new domains. In describing the goal of “ efficient tests of hypotheses,” Pearson said, if a researcher is going to have to abandon his hypothesis, he would like to do so quickly. The Bayesian, Gillies observes, might have to wait a very long time or never discover the problem (ibid., p. 368). By contrast, “ The classical statisticians do not need to indulge in such toil. They can begin with any assumption (or conjecture) they like, provided only they obey the golden rule of testing it severely” (ibid., p. 376).

  Souvenir Y: Axioms Are to Be Tested by You (Not Vice Versa)

  Axiomatic Challenge.

  What do you say if you’ re confronted with a very authoritative-sounding challenge like this: To question classic subjective Bayesian tenets (e.g., your beliefs are captured by probability, must be betting coherent, and updated via Bayes’ Rule) comes up against accepted mathematical axioms. First, recall a point from Section 2.1 : You’ re free to use any formal deductive system, the issue will be soundness. Axioms can’ t run up against empirical claims: they are formal stipulations of a system that gets meaning, and thus truth value, by interpretations. Carefully cashed out, the axioms they have in mind subtly assume your beliefs are well represented by probability, and usually that belief change follows Bayes’ Theorem. If this captures your intuitions, fine, but there’ s no non-circular proof of this.

  Empirical Studies.

  We skipped over a wing of the museum that is at least worth mentioning: there have been empirical studies over many years that refute the claim that people are intuitive Bayesians: “ we need not pursue this debate any further, for there is now overwhelming empirical evidence that no Bayesian model fits the thoughts or actions of real scientists” (Giere 1988 , p. 149). The empirical studies refer to experiments conducted since the 1960s to assess how well people obey Bayes’ Theorem. These experiments, such as those performed by Daniel Kahneman, Paul Slovic, and Amos Tversky (1982 ), reveal substantial deviations from the Bayesian model even in simple cases where the prior probabilities are given, and even with statistically sophisticated subjects. Some of the errors may result from terminology, such as the common understanding of probability as the likelihood. I don’ t know if anyone has debunked the famous “ Linda paradox” this way, but given the data, it’ s more likely that Linda’ s a feminist and a bank teller than that she’ s a bank teller, in the technical sense of “ likely.” Gerd Gigerenzer (1991 ) gives a thorough analysis showing that rephrasing the most popular probability violations frequentistly has them disappear.

  What is called in the heuristics and biases literature the “ normative theory of probability” or the like is in fact a very narrow kind of neo-Bayesian view …

  (p. 86)

  … Since “ cognitive illusions” tend to disappear in frequency judgments, it is tempting to think of the intuitive statistics of the mind as frequentist statistics.

  (ibid., p. 104)

  While interesting in their own right, I don’ t regard these studies as severe tests of whether Bayesian models are a good representation for scientific inference. Why? Because in these experiments the problem is set up to be one in which the task is calculating probabilities; the test-taker is right to assume they are answerable by probabilities.

  Normative Epistemology.

  We have been querying the supposition that what we really want for statistical inference is a probabilism. What might appear as a direct way to represent beliefs may not at
all be a direct way to use probability for a normative epistemology, to determine claims that are and are not evidentially warranted. An adequate account must be able to falsify claims statistically, and in so doing it’ s always from demonstrated effects to hypotheses, theories, or models. Neither a posterior probability nor a Bayes factor falsifies. Even to corroborate a real effect depends on falsifying “ no effect” hypotheses. Granted, showing that you have a genuine effect is just a first step in the big picture of scientific inference. You need also to show you’ ve correctly pinpointed causes, that you can triangulate with other ways of measuring the same quantity, and, more strongly still, that you understand a phenomenon well enough to exploit it to probe new domains. These abilities are what demarcate science and non-science (Section 2.3 ). Formal statistics hardly makes these assessments automatic, but we want piecemeal methods ready to serve these ends. If our language had kept to the root of probability, probare , to demonstrate or show how well you can put a claim to the test, and have it survive, we’ d find it more natural to speak of claims being well probed rather than highly probable. Severity is not to be considered the goal of science or a sum-up of the growth of knowledge, but it has a crucial role in statistical inference.

  Someone is bound to ask: Can a severity assessment be made to obey the probability axioms? If the severity for the statistical hypothesis H is high, then little problem arises in having a high degree of belief in H . But we know the axioms don’ t hold. Consider H : Humans will be cloned by 2030. Both H and ~H are poorly tested on current evidence. This always happens unless one of H , ~H is corroborated. Moreover, passing with low severity isn’ t akin to having a little bit of evidence but rather no evidence to speak of, or a poor test. What if we omitted cases of low severity due to failed audits (from violated assumptions or selection effects)? I still say no, but committed Bayesians might want to try. Since it would require the assessments to make use of sampling distributions and all that error statistics requires, it could at most be seen as a kind of probabilistic bookkeeping of inferences done in an entirely different way.

 

‹ Prev