Statistical Inference as Severe Testing

Page 51

by Deborah G Mayo

Probabilistic Instantiation Fallacy.

Suppose we did manage to do an experiment involving a random selection from an urn of null hypotheses, 100 θ % assumed to be true. The outcome may be X = 1 or 0 according to whether the hypothesis we’ ve selected is true. Even allowing it’ s known that the probability of X = 1 is 0.5, it does not follow that a specific hypothesis we might choose (say, your blood pressure drug is effective) has a frequentist probability of 0.5 of being true – any more than a particular 0.95 confidence interval estimate has a probability of 0.95. The issue, in this form, often arises in “ base rate” criticisms (Mayo 1997a , 1997b , 2005b , 2010c , Spanos 2010b ).

Is the PPV computation relevant to the very thing that working scientists want to assess: strength of the evidence for effects or their degree of corroboration?

Crud Factor.

It is supposed in many fields of social and biological science that nearly everything is related to everything: “ all nulls are false.” Meehl dubbed this the crud factor. Meehl describes how he and David Lykken conducted a study of the crud factor in psychology in 1966. They used a University of Minnesota student questionnaire sent to 57,000 high school seniors, including family facts, attitudes toward school, leisure activities, educational plans, etc. Cross-tabulating variables including parents’ occupation, education, siblings, birth order, family attitudes, sex, religious preferences, 22 leisure time activities, MCAT scores, etc., all 105 cross-tabulations were statistically significant at incredibly small levels.

These relationships are not, I repeat, Type I errors. They are facts about the world, and with N = 57,000 they are pretty stable. Some are theoretically easy to explain, others more difficult, others completely baffling. The ‘ easy’ ones have multiple explanations, sometimes competing, usually not. Drawing theories from a pot and associating them whimsically with variable pairs would yield an impressive batch of H 0 -refuting ‘ confirmations.’

(Meehl 1990 , p. 206)

He estimates the crud factor correlation at around 0.3 or 0.4.

So let’ s apply Ioannidis’ analysis to two cases. In the first case, we’ ve randomly selected a hypothesis from a social science urn with high crud factor. Even if I searched and cherry picked, perhaps looking for ones that correlate well with a theory I have in mind, statistical significance at the 0.05 level would still result in a fairly high prevalence of true claims (D’ s) among those found statistically significant. Since the test they passed lacked stringency, I wouldn’ t be able to demonstrate a genuine reproducible effect – in the manner that is understood in science. So nothing has been demonstrated about replicability or knowledge of real effects by dint of a high PPV.

You might say high prevalence could never happen with things like correlating genes and disease. But how can we count up the hypotheses? Should they include molecular biology, proteomics, stem cells, etc. Do we know what hypotheses will be conjectured next year? Why not combine fields for estimating prevalence? With a little effort, one could claim to have as high a prevalence as desired.

Now let’ s assume we are in one of those low prevalence situations. If I’ ve done my homework and went beyond the one P -value before going into print, checked flaws, tested for violated assumptions, then even if I don’ t yet know the causal explanation, I may have a fairly good warrant for taking the effect as real. Having obeyed Fisher, I am in a good position to demonstrate the reality of the published finding. Avoiding bias and premature publication is what’ s doing the work, not prevalence.

There is a seductive blurring of rates of false positives over an imagined population, PPVs, on the one hand, with an assessment of what we know about reproducing any particular effect, on the other, and fans of the DS model fall into this equivocal talk. In other words, “ positive predictive value,” in this context, is a misnomer. The number isn’ t telling us how valuable the statistically significant result is for predicting the truth or reproducibility of that effect . Nor is it even assuring lots of the findings in the group will be reproducible over time. We want to look at how well tested the particular hypothesis of interest is. We might assess the prevalence with which hypotheses pass highly stringent tests, if false. Now look what’ s happened. We have come full circle to evaluating the severity of tests passed. Prevalence has nothing to do with it.

I am reminded of the story of Isaac. Not in the Bible, but in a discussion I had with Erich Lehmann in Princeton (when his wife was working at the Educational Testing Services). It coincided with a criticism by Colin Howson (1997a ,b ) to the effect that low prevalence (or “ base rates” ) negates severity of test. Isaac is a high school student who has passed (+) a battery of tests for D: “ college-readiness.” It is given that Pr(+|~D) is 0.05, while Pr(+|D) ~1. But because he was randomly selected from Fewready town, where the prevalence of readiness is only 0.001, Pr(D|+) is still very low. Had Isaac been randomly selected from Manyready suburb with high (Pr(D), then Pr(D|+) is high. In fact Isaac, from Fewready town, would have to score quite a bit higher than if he had come from Manyready suburb for the same PPV. There is a real policy question here that officials disagree on. Should we demand higher test scores from students in Fewready town to ensure overall college-readiness amongst those accepted by college admissions boards? Or would that be a kind of reverse affirmative action?

We might go further and imagine Alex from Manyready scored lower than Isaac, maybe even cheated on just one or two questions. Even if their PPVs are equal, I submit that Isaac is in a better position to demonstrate his college readiness. 10

The Dangers of the Diagnostic Screening Model for Science

What then can we infer is replicable? Claims that have passed with severity. If subsequent tests corroborate the severity assessment of an initial study, then it is replicated. But severity is not the goal of science. Lots of true but trivial claims are not the goal. Science seeks growth of knowledge and understanding. To take the diagnostic-screening model literally, by contrast, would point the other way: keep safe.

Large-scale evidence should be targeted for research questions where the pre-study probability is already considerably high, so that a significant research finding will lead to a post-test probability that would be considered quite definitive.

(Ioannidis 2005 , p. 0700)

Who would pursue seminal research that challenged the reigning biological paradigm, as did Prusiner, doggedly pursuing, over decades, the cause of mad cow and related diseases, and the discovery of prions? Would Eddington have gone to all the trouble of testing the deflection effect in Brazil? Newton was predicting fine. Replication is just a small step toward getting real effects. Lacking the knowledge of how to bring about an effect, and how to use it to change other known and checkable effects, your PPV may be swell but your science could be at a dead-end. To be clear: its advocates surely don’ t recommend the “ keep safe” consequence, but addressing it is worthwhile to further emphasize the difference between good science and a good scorecard.

There are contexts in which the screening viewpoint is useful. Beyond diagnostic screening of disease, high-throughput testing of microarray data seeks to control the rates of genes worth following up. Nevertheless, we argue that the PPV does not quantify how well tested, warranted, or plausible a given scientific hypothesis is (including ones about genetic associations where a DS model is apt). I’ m afraid the DS model has introduced confusion into the literature, by mixing up the probability of a Type I error (often called the “ false positive rate” ) with the posterior probability given by the FFR: Pr(H 0 |H 0 is rejected). Equivocation is encouraged. In frequentist tests, reducing the Type II error probability results in increasing the Type I error probability: there is a trade-off. In the DS model, the trade-off disappears: reducing the Type II error rate also reduces the FFR.

Much of Ioannidis’ work is replete with sagacious recommendations for better designs. My aim here was the limited one of analyzing the diagnostic screening model of tests. That it’ s the basis for
popular reforms underscores the need for scrutiny.

1 That is, β ( μ .84 ) = Pr(d < 0.4; μ = 0.6) = Pr(Z < −1) = 0.16 .

2 In deciphering existing discussions on ordinary power analysis, we can suppose that d( x 0 ) happens to be exactly at the cut-off for rejection, in discussing significant results; and just misses the cut-off for discussions on insignificant results in test T+. Then att-power for μ 1 equals ordinary power for μ 1 .

3 Note: we are in “ Case 2” where we’ ve added 1.65 to the cut-off, meaning the power is the area to the right of −1.65 under the standard Normal curve (Section 5.1 ).

4 You can obtain these from the severity curves in Section 5.4 .

5 There are slight differences from their using a two-sided test, but we hardly add anything for the negative direction: For (a), ≃ 0. The severe tester would not compute power using both directions once she knew the result.

6 The point can also be made out by increasing power by dint of sample size. If n = 10,000, ( σ /√ n ) = 0.1. Test T+(n = 10,000) rejects H 0 at the 0.025 level if . A 95% confidence interval is [150, 150.4]. With n = 100, the just 0.025 significant result 152 corresponds to the interval [150, 154]. The latter is indicative of a larger discrepancy. Granted, sample size must be large enough for the statistical assumptions to pass an audit.

7 Some call it the false discovery rate, but that was already defined by Benjamini and Hochberg in connection with the problem of multiple comparisons. (see Section 4.6 ).

8 The screening model used here has also been criticized by many even for screening itself. See, for example, Dawid (1976 ).

9 Even without bias, it’ s expected that only 50% of statistically significant results will replicate as significantly on the next try, but such a probability is to be expected (Senn 2002 ). Senn regards such probabilities as irrelevant.

10 Peter Achinstein and I have debated this on and off for years (Achinstein 2010 ; Mayo 1997a , 2005b , 2010c ).

Tour III

Deconstructing the N-P versus Fisher Debates

[Neyman and Pearson] began an influential collaboration initially designed primarily, it would seem, to clarify Fisher’ s writing. This led to their theory of testing hypotheses and to Neyman’ s development of confidence intervals, aiming to clarify Fisher’ s idea of fiducial intervals. As late as 1932 Fisher was writing to Neyman encouragingly about this work, but relations soured, notably when Fisher greatly disapproved of a paper of Neyman’ s on experimental design and no doubt partly because their being in the same building at University College London brought them too close to one another!

(Cox 2006a , p. 195)

Who but David Cox could so expertly distill the nitty-gritty of a long story in short and crisp terms? It hits all the landmarks we want to visit in Tour III. Wearing error statistical spectacles gives a Rosetta Stone for a novel deconstruction of some of the best known artifacts. We begin with the most famous passage from Neyman and Pearson (1933 ), often taken as the essence of the N-P philosophy. We’ ll make three stops:

First, we visit a local theater group performing “ Les Miserables Citations” ;

Next, I’ ve planned a daytrip to Fisher’ s Fiducial Island for some little explored insights into our passage;

Third, we’ ll get a look at how philosophers of statistics have deconstructed that same passage.

I am using “ deconstruct” in the sense of “ analyze or reduce to expose assumptions or reinterpret.” A different sense goes along with “ deconstructionism,” whereby it’ s thought texts lack fixed meaning. I’ m allergic to the relativistic philosophies associated with this secondary sense. Still, here we’ re dealing with methods about which advocates say the typical performance metaphor is just a heuristic, not an instruction for using the methods. In using them, they are to be given a subtle evidential reading. So it’ s fitting to speak of disinterring a new meaning.

5.7 Statistical Theatre: “ Les Miserables Citations”

We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis.

But we may look at the purpose of tests from another view-point. Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong.

(Neyman and Pearson 1933 , pp. 141– 2)

Neyman and Pearson wrote these paragraphs once upon a time when they were still in the midst of groping toward the basic concepts of tests – for example, “ power” had yet to be coined. Yet they are invariably put forward as proof positive that N-P tests are relevant only for a crude long-run performance goal. I’ m not dismissing the centrality of these passages, nor denying the 1933 paper records some of the crucial early developments. I am drawn to these passages because taken out of context, as they so often are, they have led to knee-jerk interpretations to which our famous duo would have objected. What was the real context of those passages? The paper opens, just five paragraphs earlier, with a discussion of two French probabilists – Joseph Bertrand, author of Calculus of Probabilities (1889 ), and Émile Borel, author of Le Hasard (1914 )!

Neyman had attended Borel’ s lectures in Paris, and he returns to the Bertrand– Borel debate in no less than five different papers – one “ an appreciation” for Egon Pearson when he died – and in recounting core influences on N-P theory to biographer Constance Reid. Erich Lehmann (1993a ) wrote an entire paper on “ The Bertrand-Borel Debate and the Origins of the Neyman Pearson Theory.” 1 A deconstruction of the debate illuminates the inferential over the behavioristic construal of tests – somewhat surprisingly given the behavioristic-sounding passage to follow. We’ re in time for a matinee where the key characters are placed in an (imaginary) theater production. It’ s titled “ Les Miserables Citations.” (Lehmann’ s translation from the French is used where needed.)

The curtain opens with a young Neyman and Pearson (from 1933 ) standing mid-stage, lit by a spotlight. (All speaking parts are exact quotes; Neyman does the talking.)

Neyman and Pearson (N-P) : Bertrand put into statistical form a variety of hypotheses, as for example the hypothesis that a given group of stars … form a ‘ system.’ His method of attack, which is that in common use, consisted essentially in calculating the probability, P, that a certain character, x, of the observed facts would arise if the hypothesis tested were true. If P were very small, this would generally be considered as an indication that … H was probably false, and vice versa . Bertrand expressed the pessimistic view that no test of this kind could give reliable results.

Borel, however, … considered that the method described could be applied with success provided that the character, x, of the observed facts were properly chosen – were, in fact, a character which he terms ‘ en quelque sorte remarquable’ .

(Neyman and Pearson 1933 , p. 141).

The stage fades to black, then a spotlight shines on Bertrand, stage right .

Bertrand : How can we decide on the unusual results that chance is incapable of producing? … The Pleiades appear closer to each other than one would naturally expect … In order to make the vague idea of closeness more precise, should we look for the smallest circle that contains the group? the largest of the angular distances? the sum of squares of all the distances? … Each of these quantities is smaller for the group of the Pleiades than seems plausible. Which of them should provide the measure of implausibility?

He turns to the audience, shaking his head .

The application of such calculations to questions of this kind is a delusion and an abuse.

(Bertrand 1889 , p. xvii; Lehmann 1993a , p. 963)

The stage fades to black, then a spotlight appears on Borel, stage left .

Borel : The particular form that problems of causes often take … is the following: Is such and such a r
esult due to chance or does it have a cause? It has often been observed how much this statement lacks in precision. Bertrand has strongly emphasized this point. But … to refuse to answer under the pretext that the answer cannot be absolutely precise, is to … misunderstand the essential nature of the application of mathematics. [Bertrand considers the Pleiades.] If one has observed a [precise angle between the stars] … in tenths of seconds … one would not think of asking to know the probability [of observing exactly this observed angle under chance] because one would never have asked that precise question before having measured the angle…

The question is whether one has the same reservations in the case in which one states that one of the angles of the triangle formed by three stars has ‘ une valeur remarquable ’ [a striking or noteworthy value], and is for example equal to the angle of the equilateral triangle…

‹ Prev Next ›