Book Read Free

Statistical Inference as Severe Testing

Page 31

by Deborah G Mayo


  I would have liked to report a more exciting ending for our tour. The promising bump or “ resonance” disappeared as more data became available, drowning out the significant indications seen in April. Its reality was falsified.

  Souvenir O: Interpreting Probable Flukes

  There are three ways to construe a claim of the form: A small P -value indicates it’ s improbable that the results are statistical flukes.

  (1) The person is using an informal notion of probability, common in English. They mean a small P -value gives grounds (or is evidence) of a genuine discrepancy from the null. Under this reading there is no fallacy. Having inferred H *: Higgs particle, one may say informally, “ so probably we have experimentally demonstrated the Higgs,” or “ probably, the Higgs exists.” “ So probably” H 1 is merely qualifying the grounds upon which we assert evidence for H 1 .

  (2) An ordinary error probability is meant. When particle physicists associate a 5-sigma result with claims like “ it’ s highly improbable our results are a statistical fluke,” the reference for “ our results” includes: the overall display of bumps, with significance growing with more and better data, along with satisfactory crosschecks. Under this reading, again, there is no fallacy.

  To turn the tables on the Bayesians a bit, maybe they’ re illicitly sliding from what may be inferred from an entirely legitimate high probability. The reasoning is this: With probability 0.9999997, our methods would show that the bumps disappear, under the assumption the data are due to background H 0 . The bumps don’ t disappear but grow. Thus, infer H *: real particle with thus and so properties. Granted, unless you’ re careful about forming probabilistic complements, it’ s safer to adhere to the claims along the lines of U-1 through U-3. But why not be careful in negating D claims? An interesting phrase ATLAS sometimes uses is in terms of “ the background fluctuation probability” : “ This observation, which has a significance of 5.9 standard deviations, corresponding to a background fluctuation probability of 1.7 × 10−9 , is compatible with … the Standard Model Higgs boson” (2012b , p.1).

  (3) The person is interpreting the P -value as a posterior probability of null hypothesis H 0 based on a prior probability distribution: p = Pr(H 0 | x ). Under this reading there is a fallacy. Unless the P -value tester has explicitly introduced a prior, it would be “ ungenerous” to twist probabilistic assertions into posterior probabilities. It would be a kind of “ confirmation bias” whereby one insists on finding a sentence among many that could be misinterpreted Bayesianly.

  ASA 2016 Guide : Principle 2 reminds practitioners that P -values aren’ t Bayesian posterior probabilities, but it slides into questioning an interpretation sometimes used by practitioners – including Higgs researchers:

  P -values do not measure (a) the probability that the studied hypothesis is true, or (b) the probability that the data were produced by random chance alone.

  (Wasserstein and Lazar 2016 , p. 131) 4

  I insert the (a), (b), absent from the original principle 2, because, while (a) is true, phrases along the lines of (b) should not be equated to (a).

  Some might allege that I’ m encouraging a construal of P -values that physicists have bent over backwards to avoid! I admitted at the outset that “ the problem is a bit delicate, and my solution is likely to be provocative.” My question is whether it is legitimate to criticize frequentist measures from a perspective that assumes a very different role for probability. Let’ s continue with the ASA statement under principle 2:

  Researchers often wish to turn a p -value into a statement about the truth of a null hypothesis, or about the probability that random chance produced the observed data. The p -value is neither. It is a statement about data in relation to a specified hypothetical explanation, and is not a statement about the explanation itself.

  (Wasserstein and Lazar 2016 , p. 131)

  Start from the very last point: what does it mean, that it’ s not “ about the explanation” ? I think they mean it’ s not a posterior probability on a hypothesis, and that’ s correct. The P -value is a methodological probability that can be used to quantify “ how well probed” rather than “ how probable.” Significance tests can be the basis for, among other things, falsifying a proposed explanation of results, such as that they’ re “ merely a statistical fluctuation.” So the statistical inference that emerges is surely a statement about the explanation. Even proclamations issued by high priests – especially where there are different axes to grind – should be taken with severe grains of salt.

  As for my provocative interpretation of “ probable fluctuations,” physicists might aver, as does Cousins, that it’ s the science writers who take liberties with the physicists’ careful U-type statements, turning them into D-type statements. There’ s evidence for that, but I think physicists may be reacting to criticisms based on how things look from Bayesian probabilists’ eyes. For a Bayesian, once the data are known, they are fixed; what’ s random is an agent’ s beliefs or uncertainties on what’ s unknown – namely the hypothesis. For the severe tester, considering the probability of {d( X ) ≥ d( x 0 )} is scarcely irrelevant once d( x 0 ) is known. It’ s the way to determine, following the severe testing principles, whether the null hypothesis can be falsified. ATLAS reports, on the basis of the P -value display, that “ these results provide conclusive evidence for the discovery of a new particle with mass [approximately 125 GeV]” (ATLAS collaboration 2012b , p. 15).

  Rather than seek a high probability that a suggested new particle is real; the scientist wants to find out if it disappears in a few months. As with GTR (Section 3.1), at no point does it seem we want to give a high formal posterior probability to a model or theory. We’ d rather vouchsafe some portion, say the SM model with the Higgs particle, and let new data reveal, perhaps entirely unexpected, ways to extend the model further. The open-endedness of science must be captured in an adequate statistical account. Most importantly, the 5-sigma report, or corresponding P -value, strictly speaking, is not the statistical inference . Severe testing premises – or something like them – are needed to move from statistical data plus background (theoretical and empirical) to detach inferences with lift-off.

  1 To avoid confusion, note the duality is altered accordingly. If we set out the test rule for T + H 0 : μ ≤ μ 0 vs H 1 : μ > μ 0 as reject H 0 : iff X̅ ≥ μ 0 + c α ( σ /√ n ), then we do not reject H 0 iff X̅ < μ 0 + c α ( σ /√ n ). This is the same as μ 0 > X̅ − c α ( σ /√ n ), the corresponding lower CI bound. If the test rule is X̅ > μ 0 + c α ( σ /√ n ), the corresponding lower bound is μ 0 ≥ X̅ – c α ( σ /√ n ).

  2 For the computations, in test T+: H 0 : μ ≤ μ 0 against H 1 : μ > μ 0 . Suppose the observed just reaches the c α cut-off: . The (1 − α ) CI lower bound, CIL , is . So . Standardize to get Z : . So the severity for μ > μ 0 = Pr(test T+ does not reject H 0 ; μ = CIL ) = Pr(Z < c α ) = (1 − α ) .

  3 The inference to (2) is a bit stronger than merely falsifying the null because certain properties of the particle must be shown at the second stage.

  4 The ASA 2016 Guide’ s Six Principles:

  1. P -values can indicate how incompatible the data are with a specified statistical model.

  2. P -values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.

  3. Scientific conclusions and business or policy decisions should not be based only on whether a p -value passes a specific threshold.

  4. Proper inference requires full reporting and transparency.

  5. A p -value, or statistical significance, does not measure the size of an effect or the importance of a result.

  6. By itself, a p -value does not provide a good measure of evidence regarding a model or hypothesis.

  These principles are of minimal help when it comes to understanding and using P -values. The first thing that jumps out is the absence of any mention of P -values as error probabilities. (Fisher-N-P Incom
patibilist tribes might say “ they’ re not!” In tension with this is the true claim (under #4) that cherry picking results in spurious P -values; p. 132.) The ASA effort has merit, and should be extended and deepened.

  Excursion 4

  Objectivity and Auditing

  Itinerary

  Tour I The Myth of “ The Myth of Objectivity” 4.1 Dirty Hands: Statistical Inference is Sullied with Discretionary Choices

  4.2 Embrace Your Subjectivity

  Tour II Rejection Fallacies: Who’ s Exaggerating What? 4.3 Significant Results with Overly Sensitive Tests: Large n Problem

  4.4 Do P -Values Exaggerate the Evidence?

  4.5 Who’ s Exaggerating? How to Evaluate Reforms Based on Bayes Factor Standards

  Tour III Auditing: Biasing Selection Effects and Randomization 4.6 Error Control Is Necessary for Severity Control

  4.7 Randomization

  Tour IV More Auditing: Objectivity and Model Checking 4.8 All Models Are False

  4.9 For Model-Checking, They Come Back to Significance Tests

  4.10 Bootstrap Resampling: My Sample Is a Mirror of the Universe

  4.11 Misspecification (M-S) Testing in the Error Statistical Account

  Tour I

  The Myth of “ The Myth of Objectivity”

  Objectivity in statistics, as in science more generally, is a matter of both aims and methods. Objective science, in our view, aims to find out what is the case as regards aspects of the world [that hold] independently of our beliefs, biases and interests; thus objective methods aim for the critical control of inferences and hypotheses, constraining them by evidence and checks of error.

  (Cox and Mayo 2010 , p. 276)

  Whenever you come up against blanket slogans such as “ no methods are objective” or “ all methods are equally objective and subjective” it is a good guess that the problem is being trivialized into oblivion. Yes, there are judgments, disagreements, and values in any human activity, which alone makes it too trivial an observation to distinguish among very different ways that threats of bias and unwarranted inferences may be controlled. Is the objectivity– subjectivity distinction really toothless, as many will have you believe? I say no. I know it’ s a meme promulgated by statistical high priests, but you agreed, did you not, to use a bit of chutzpah on this excursion? Besides, cavalier attitudes toward objectivity are at odds with even more widely endorsed grass roots movements to promote replication, reproducibility, and to come clean on a number of sources behind illicit results: multiple testing, cherry picking, failed assumptions, researcher latitude, publication bias and so on. The moves to take back science are rooted in the supposition that we can more objectively scrutinize results – even if it’ s only to point out those that are BENT. The fact that these terms are used equivocally should not be taken as grounds to oust them but rather to engage in the difficult work of identifying what there is in “ objectivity” that we won’ t give up, and shouldn’ t.

  The Key Is Getting Pushback!

  While knowledge gaps leave plenty of room for biases, arbitrariness, and wishful thinking, we regularly come up against data that thwart our expectations and disagree with the predictions we try to foist upon the world. We get pushback! This supplies objective constraints on which our critical capacity is built. Our ability to recognize when data fail to match anticipations affords the opportunity to systematically improve our orientation. Explicit attention needs to be paid to communicating results to set the stage for others to check, debate, and extend the inferences reached. Which conclusions are likely to stand up? Where do the weakest parts remain? Don’ t let anyone say you can’ t hold them to an objective account.

  Excursion 2 , Tour II led us from a Popperian tribe to a workable demarcation for scientific inquiry. That will serve as our guide now for scrutinizing the myth of the myth of objectivity. First, good sciences put claims to the test of refutation, and must be able to embark on an inquiry to pin down the sources of any apparent effects. Second, refuted claims aren’ t held on to in the face of anomalies and failed replications; they are treated as refuted in further work (at least provisionally); well-corroborated claims are used to build on theory or method: science is not just stamp collecting. The good scientist deliberately arranges inquiries so as to capitalize on pushback, on effects that will not go away, on strategies to get errors to ramify quickly and force us to pay attention to them. The ability to register how hunting, optional stopping, and cherry picking alter their error-probing capacities is a crucial part of a method’ s objectivity. In statistical design, day-to-day tricks of the trade to combat bias are consciously amplified and made systematic. It is not because of a “ disinterested stance” that we invent such methods; it is that we, quite competitively and self-interestedly, want our theories to succeed in the market place of ideas.

  Admittedly, that desire won’ t suffice to incentivize objective scrutiny if you can do just as well producing junk. Successful scrutiny is very different from success at grants, getting publications and honors. That is why the reward structure of science is so often blamed nowadays. New incentives, gold stars and badges for sharing data and for resisting the urge to cut corners are being adopted in some fields. Fortunately, for me, our travels will bypass lands of policy recommendations, where I have no special expertise. I will stop at the perimeters of scrutiny of methods which at least provide us citizen scientists armor against being misled. Still, if the allure of carrots has grown stronger than the sticks, we need stronger sticks.

  Problems of objectivity in statistical inference are deeply intertwined with a jungle of philosophical problems, in particular with questions about what objectivity demands, and disagreements about “ objective versus subjective” probability. On to the jungle!

  4.1 Dirty Hands: Statistical Inference Is Sullied with Discretionary Choices

  If all flesh is grass, kings and cardinals are surely grass, but so is everyone else and we have not learned much about kings as opposed to peasants.

  (Hacking 1965 , p. 211)

  Trivial platitudes can appear as convincingly strong arguments that everything is subjective. Take this one: No human learning is pure so anyone who demands objective scrutiny is being unrealistic and demanding immaculate inference. This is an instance of Hacking’ s “ all flesh is grass.” In fact, Hacking is alluding to the subjective Bayesian de Finetti (who “ denies the very existence of the physical property [of] chance” (ibid.)). My one-time colleague, I. J. Good, used to poke fun at the frequentist as “ denying he uses any judgments!” Let’ s admit right up front that every sentence can be prefaced with “ agent x judges that,” and not sweep it under the carpet (SUTC) as Good (1976 ) alleges. Since that can be done for any statement, it cannot be relevant for making the distinctions in which we are interested, and we know can be made, between warranted or well-tested claims and those so poorly probed as to be BENT. You’ d be surprised how far into the thicket you can cut your way by brandishing this blade alone.

  It is often urged that, however much we may aim at objective constraints, we can never have clean hands, free of the influence of beliefs and interests. We invariably sully methods of inquiry by the entry of background beliefs and personal judgments in their specification and interpretation. The real issue is not that a human is doing the measuring; the issue is whether that which is being measured is something we can reliably use to solve some problem of inquiry. An inference done by machine, untouched by human hands, wouldn’ t make it objective in any interesting sense. There are three distinct requirements for an objective procedure of inquiry:

  1. Relevance: It should be relevant to learning about what is being measured; having an uncontroversial way to measure something is not enough to make it relevant to solving a knowledge-based problem of inquiry.

  2. Reliably capable: It should not routinely declare the problem solved when it is not (or solved incorrectly); it should be capable of controlling reports of erroneous solutions to problems with reliability.

/>   3. Able to learn from error: If the problem is not solved (or poorly solved) at a given point, the method should set the stage for pinpointing why.

  Yes, there are numerous choices in collecting, analyzing, modeling, and drawing inferences from data, and there is often disagreement about how they should be made, and about their relevance for scientific claims. Why suppose that this introduces subjectivity into an account, or worse, means that all accounts are in the same boat as regards subjective factors? It need not, and they are not. An account of inference shows itself to be objective precisely in how it steps up to the plate in handling potential threats to objectivity.

  Dirty Hands Argument.

  To give these arguments a run for their money, we should try to see why they look so plausible. One route is to view the reasoning as follows:

  1. A variety of human judgments go into specifying experiments, tests, and models.

  2. Because there is latitude and discretion in these specifications, which may reflect a host of background beliefs and aims, they are “ subjective.”

  3. Whether data are taken as evidence for a statistical hypothesis or model depends on these subjective methodological choices.

 

‹ Prev