Book Read Free

Statistical Inference as Severe Testing

Page 33

by Deborah G Mayo


  Beyond Persuasion and Coercion

  The true blue subjectivists regard the call to free Bayesianism from beliefs as a cop-out. As they see it, statisticians ought to take responsibility for their personal assessments.

  To admit that my model is personal means that I must persuade you of the reasonableness of my assumptions in order to convince you … To claim objectivity is to try to coerce you into consenting, without requiring me to justify the basis for the assumptions.

  (Kadane 2006 , p. 434)

  The choice should not be persuasion or coercion. Perhaps the persuasion ideal served at a time when a small group of knowledgeable Bayesians could be counted on to rigorously critique each other’ s outputs. Now we have massive data sets and powerful data-dredging tools. What about the allegation of coercion? I guess being told it’ s an Objective prior (with a capital O) can sound coercive. Yet anyone who has met Jim Berger will be inclined to agree that the line between persuasion and coercion is quite thin. His assurances that we’ re controlling error probabilities (even if he’ s slipped into error probability2 ) can feel more seductive than coercive (Excursion 3 ).

  Wash-out Theorems

  If your prior beliefs are not too extreme, and if model assumptions hold, then if you continually observe data on H and update by Bayes’ Theorem, in some long run the posteriors will converge – assuming your beliefs about the likelihoods providing a random sample are correct. It isn’ t just that these wash-out theorems have limited guarantees or that they depend on agents assigning non-zero priors to the same set of hypotheses, or that even with non-extreme prior probabilities, and any body of evidence, two scientists can have posteriors that differ by arbitrary amounts (Kyburg 1992 , p. 146); it’ s that appeals to consilience of beliefs in an asymptotic long run have little relation to the critical appraisal that we demand regarding the case at hand. The error statistician, and the rest of us, can and will raise criticisms of bad evidence, no test (BENT) regarding today’ s study. Ironically, the Bayesians appeal to a long run of repeated applications of Bayes’ Theorem to argue that their priors would wash out eventually. Look who is appealing to long runs! Fisher’ s response to the possibility of priors washing out is that far from showing their innocuousness, “ we may well ask what the expression is doing in our reasoning at all, and whether, if it were altogether omitted, we could not without its aid draw whatever inferences may, with validity, be inferred from the data” (Fisher 1934b , p. 287).

  Take Account of “ Subjectivities”

  This is radically ambiguous! A well-regarded position about objectivity in science is that it is best promoted by excluding personal opinions, biases, preferences, and interests; if you can’ t exclude these, you ought at least to take account of them . How should you do this? It seems obvious it should be done in a way that excludes biasing influences from claims as to what the data have shown . Or if not, their influence should be made explicit in a report of findings. There’ s a very different way of “ taking account of” them: To wit: view them as beliefs in the claim of inquiry, quantify them probabilistically, and blend them into the data. If they are to be excluded, they can’ t at the same time be blended; one can’ t have it both ways. Consider some exhibits regarding taking account of biases.

  Exhibit (i): Prior Probabilities Let Us Be Explicit about Bias.

  There’ s a constellation of positions along these lines, but let’ s consider Nate Silver, the well-known pollster and data analyst. I was sitting in the audience when he gave the invited president’ s address for the American Statistical Association in 2013. He told us the reason he favored the Bayesian philosophy of statistics is that people – journalists in particular – should be explicit about the biases, prior conceptions, wishes, and goals that invariably enter into the collection and interpretation of data.

  How would this work, say in FiveThirtyEight , the online statistically based news source of which Silver is editor-in-chief? Perhaps it would go like this: if a journalist is writing on, say, GM foods, she should declare at the outset she believes their risks are exaggerated (or the other way around). Then the reader can understand that her selection and interpretation of facts was through the lens of the “ GM is safe” theory. Isn’ t this tantamount to saying she’ s unable to evaluate the data impartially – belying the goal of news based on “ hard numbers” ? Perhaps to some degree this is true. However, if people are inclined to see the world using tunnel vision, what’ s the evidence they’ d be able or willing to be explicit about their biases? Imagine for the moment they would. Suppose further that prior probabilities are to be understood as expressing these biases – say the journalist’ s prior probability in GM risks is low.

  Now if the prior was kept separate, readers could see if the data alone point to increased GM risks. If so, they reveal how the journalist’ s priors biased the results. But if only the posterior probability was reported, they cannot. Even reporting the priors may not help if it’ s complicated, which, to an untutored reader, they always are. Further, how is the reader to even trust the likelihoods? Even if they could be, why would the reader want the journalist to blend her priors – described by Silver as capturing biases – into the data? It would seem to be just the opposite. Someone might say they checked the insensitivity of an inference over a range of priors. That can work in some cases, but remember they selected the priors to look at. To you and me, these points seem to go without saying, but in today’ s environment, it’ s worth saying them. 4

  Exhibit (ii): Prior Probabilities Allow Combining Background Information with Data.

  In a published, informal spoken exchange between Cox and me, the question of background information arose.

  Cox: Fisher’ s resolution of this issue in the context of the design of experiments was essentially that in designing an experiment you do have all sorts of prior information, and you use that to set up a good experimental design. Then when you come to analyze it, you do not use the prior information. In fact you have very clever ways of making sure that your analysis is valid even if the prior information is totally wrong.

  (Cox and Mayo 2011 , p. 104– 5)

  Mayo: But they should use existing knowledge.

  Cox: Knowledge yes … It’ s not evidence that should be used if let’ s say a group of surgeons claim we are very, very strongly convinced, maybe to probability 0.99, that this surgical procedure works and is good for patients, without inquiring where the 0.99 came from. It’ s a very dangerous line of argument. But not unknown.

  (ibid., p. 107)

  Elsewhere, Cox remarks (2006a , p. 200):

  Expert opinion that is not reasonably firmly evidence-based may be forcibly expressed but is in fact fragile. The frequentist approach does not ignore such evidence but separates it from the immediate analysis of the specific data under consideration.

  Admittedly, frequentists haven’ t been clear enough as to the informal uses of background knowledge, especially at the stage of “ auditing.” They leave themselves open to the kind of challenge Andrew Gelman (2012 ) puts to Cox, in reference to Cox and Mayo (2011 ).

  Surely, Gelman argues, there are cases where the background knowledge is so strong that it should be used in the given inference.

  Where did Fisher’ s principle go wrong here? The answer is simple – and I think Cox would agree with me here. We’ re in a setting where the prior information is much stronger than the data. … it is essential to use prior information (even if not in any formal Bayesian way) to interpret the data and generalize from sample to population.

  (Gelman 2012 , p. 53)

  Now, in the same short paper, Gelman, who identifies as Bayesian, declares: “ Bayesians Want Everybody Else to be Non-Bayesian.”

  Bayesian inference proceeds by taking the likelihoods from different data sources and then combining them with a prior (or, more generally, a hierarchical model). The likelihood is key. … No funny stuff, no posterior distributions, just the likelihood… I don’ t want everybody coming t
o me with their posterior distribution – I’ d just have to divide away their prior distributions before getting to my own analysis.

  (ibid., p. 54)

  No funny stuff, no posterior distributions, says Gelman. Thus, he too is recommending the priors and likelihoods be kept separate, at least for this purpose (scrutinizing an inquiry using background).

  So is he agreeing or disagreeing with Cox? Perhaps Gelman is saying: don’ t combine the prior with the likelihood, but allow well-corroborated background to be used as grounds for scrutinizing, or, in my terms, conducting an “ audit” of, the statistical inference. A statistical inference fails an audit if either the statistical assumptions aren’ t adequately met, or the error probabilities are invalidated by biasing selection effects. In that case there’ s no real disagreement with Cox’ s use of background. Still, there is something behind Gelman’ s lament that deserves to be made explicit. There’ s no reason for the frequentist to restrict background knowledge to pre-data experimental planning and test specification. We showed how the background gives the context for a FIRST interpretation in Section 3.3 . Audits also employ background, and may likely be performed by a different party than those who designed and conducted the study. This would not be a Bayesian updating to a posterior probability, but would use any well-corroborated background knowledge in auditing. A background repertoire of the slings and arrows known to threaten the type of inquiry may show a statistical inference fails an audit, or ground suspicion that it would fail an audit.

  Exhibit (iii): Use Knowledge of a Repertoire of Mistakes.

  The situation is analogous, though not identical, when background knowledge shows a hypothesized effect to have been falsified: since the effect doesn’ t exist, any claim to have found it is due to some flaw; unless there was a special interest in pinpointing it, that would suffice. This is simple deductive reasoning. It’ s fair to say that experimental ESP was falsified some time in the 1980s, even though one can’ t point to a single bright line event. You might instead call it a “ degenerating program” (to use Lakatos’ term): anomalies regularly must be explained away by ad hoc means. In each case, Perci Diaconis (1978 ), statistician and magician, explains that “ controls often are so loose that no valid statistical analysis is possible. Some common problems are multiple end points, subject cheating, and unconscious sensory cueing” (p. 131). There may be a real effect, but it’ s not ESP. It may be that Geller bent the spoon when you weren’ t looking, or that flaws entered in collecting, selecting, and reporting data. A severe tester would infer that experimental ESP doesn’ t exist, that the purported reality of the effect had been falsified on these grounds.

  Strictly speaking, even falsifications may be regarded as provisional, and the case reopened. Human abilities could evolve. However, anyone taking up an effect that has been manifested only with highly questionable research practices or insevere tests, must, at the very least, show they have avoided the well-known tricks in the suitcase of mistakes that a researcher in the field should be carrying. If they do not, or worse, openly flout requirements to avoid biasing selection effects, then they haven’ t given a little bit of evidence – as combining prior and likelihood could allow – but rather an inference that’ s BENT. A final exhibit:

  Exhibit (iv): Objectivity in Epistemology.

  Kent Staley is a philosopher of science who has developed the severity account based on error statistics (he calls it the ES account), linking it to more traditional distinctions in epistemology, notably between “ internalist” and “ externalist” accounts. In a paper with Aaron Cobb:

  … there seems to be a resemblance between ES and a paradigmatically externalist account of justification in epistemology. Just as Alvin Goldman’ s reliabilist theory makes justification rest on the tendency of a belief-forming process to produce true rather than false beliefs (Goldman 1986 , 1999 ), ES links the justification of an inference to its having resulted from a testing procedure with low error probabilities (Woodward 2000 ).

  (Staley and Cobb 2011 , p. 482)

  The last sentence would need to read “ low error probabilities relevant for satisfying severity,” since low error probabilities won’ t suffice for a good test. My problem with the general epistemological project of giving necessary and sufficient conditions for knowledge or justified belief or the like is that it does not cash out terms such as “ reliability” by alluding to actual methods. The project is one of definition. That doesn’ t mean it’ s not of interest to try and link to the more traditional epistemological project to see where it leads. In so doing, Staley and Cobb are right to note that the error-statistician will not hold a strictly externalist view of justification. The trouble with “ externalism” is that it makes it appear that a claim (or “ belief” as many prefer), is justified so long as a severity relationship SEV holds between data, hypotheses, and a test. It needn’ t be able to be shown or known. The internalist view, like the appeal to inner coherence in subjective Bayesianism, has a problem in showing how internally justified claims link up to truth. The analytical internal/external distinction isn’ t especially clear, but from the perspective of that project, Staley and Cobb are right to view ES as a “ hybrid” view. In the ES view, the reliability of a method is independent of what anybody knows, but the knower or group of knowers must be able to respond to skeptical challenges such as: you’ re overlooking flaws, you haven’ t taken precautions to block errors and so on. They must display the ability to put to rest reasonable skeptical challenges. (Not just any skeptical doubts count, as discussed in solving induction in Section 2.7.) This is an integral part of being an adequate scientific researcher in a domain. (We can sidestep the worry epistemologists might voice that this precludes toddlers from having knowledge; even toddlers can non-verbally display their know-how.) Without showing a claim has been well probed, it has not been well corroborated. Warranting purported severity claims is the task of auditing.

  There are interesting attempts to locate objectivity in science in terms of the diversity and clout of the members of the social groups doing the assessing (Longino 2002). Having the stipulated characteristics might even correlate with producing good assessments, but it seems to get the order wrong (Miller 2008). It’ s necessary to first identify the appropriate requirements for objective criticism. What matters are methods whose statistical properties may be shown in relation to probes on real experiments and data.

  Souvenir P: Transparency and Informativeness

  There are those who would replace objectivity with the fashionable term “ transparency.” Being transparent about what was done and how one got from the raw data to the statistical inferences certainly promotes objectivity, provided I can use that information to critically appraise the inference. For example, being told about stopping rules, cherry picking, altered endpoints, and changed variables is useful in auditing your error probabilities. Simmons, Nelson, and Simonsohn (2012 ) beg researchers to “ just say it,” if you didn’ t p-hack or commit other QRPs. They offer a “ 21 word solution” that researchers can add to a Methods section: “ We report how we determined our sample size, all data exclusions (if any), all manipulations, and all measures in the study (p. 4).” If your account doesn’ t use error probabilities, however, it’ s unclear how to use reports of what would alter error probabilities.

  You can’ t make your inference objective merely announcing your choices and your reasons; there needs to be a system in place to critically evaluate that information. It should not be assumed the scientist is automatically to be trusted. Leading experts might arrive at rival statistical inferences, each being transparent as to their choices of a host of priors and models. What then? It’ s likely to descend into a battle of the experts. Salesmanship, popularity, and persuasiveness are already too much a part of what passes for knowledge. On the other hand, if well-understood techniques are provided for critical appraisal of the elements of the statistical inference, then transparency could have real force.

  One l
ast thing. Viewing statistical inference as severe testing doesn’ t mean our sole goal is severity. “ Shun error” is not a terribly interesting rule to follow. To merely state tautologies is to state objectively true claims, but they are vacuous. We are after the dual aims of severity and informativeness. Recalling Popper, we’ re interested in “ improbable” claims – claims with high information content that can be subjected to more stringent tests, rather than low content claims. Fisher had said that in testing causal claims you should “ make [your] theories elaborate by which he meant … [draw out] implications” for many different phenomena, increasing the chance of locating any flaw (Mayo and Cox 2006 , p. 264). As I see it, the goals of stringent testing and informative theories are mutually reinforcing. Let me explain.

  To attain stringent tests, we seek strong arguments from coincidence, and “ corroborative tendrils” in order to triangulate results. In so doing, we seek to make our theories more general as Fisher said. A more general claim not only has more content, opening itself up to more chances of failing, it enables cross-checks to ensure that a mistake not caught in one place is likely to ramify somewhere else. A hypothesis H * with greater depth or scope than another H may be said to be at a “ higher level” than H in my horizontal “ hierarchy” (Figure 2.1 ). For instance, the full GTR is at a higher level than the individual hypothesis about light deflection; and current theories about prion diseases are at a higher level than Prusiner’ s initial hypotheses limited to kuru. If a higher level theory H * is subjected to tests with good capacity (high probability) of finding errors, it would be necessary to check and rule out more diverse phenomena than the more limited lower level hypothesis H . Were H * to nevertheless pass tests, then it does so with higher severity than does H .

 

‹ Prev