Statistical Inference as Severe Testing

Page 25

by Deborah G Mayo

Basis for the joke: An N-P test bases error probabilities on all possible outcomes or measurements that could have occurred in repetitions, but did not.

As with many infamous pathological examples, often presented as knockdown criticisms of all of frequentist statistics, this was invented by a frequentist, Cox (1958 ). It was a way to highlight what could go wrong in the case at hand, if one embraced an unthinking behavioral-performance view. Yes, error probabilities are taken over hypothetical repetitions of a process, but not just any repetitions will do. Here’ s the statistical formulation.

We flip a fair coin to decide which of two instruments, E1 or E2 , to use in observing a Normally distributed random sample Z to make inferences about mean θ . E1 has variance of 1, while that of E2 is 106 . Any randomizing device used to choose which instrument to use will do, so long as it is irrelevant to θ . This is called a mixture experiment. The full data would report both the result of the coin flip and the measurement made with that instrument. We can write the report as having two parts: First, which experiment was run and second the measurement: (E i , z ), i = 1 or 2.

In testing a null hypothesis such as θ = 0, the same z measurement would correspond to a much smaller P -value were it to have come from E1 rather than from E2 : denote them as p 1 ( z ) and p 2 ( z ), respectively. The overall significance level of the mixture: [p 1 ( z ) + p 2 ( z )]/2, would give a misleading report of the precision of the actual experimental measurement. The claim is that N-P statistics would report the average P -value rather than the one corresponding to the scale you actually used! These are often called the unconditional and the conditional test, respectively. The claim is that the frequentist statistician must use the unconditional test.

Suppose that we know we have observed a measurement from E2 with its much larger variance:

The unconditional test says that we can assign this a higher level of significance than we ordinarily do, because if we were to repeat the experiment, we might sample some quite different distribution. But this fact seems irrelevant to the interpretation of an observation which we know came from a distribution [with the larger variance].

(Cox 1958 , p. 361)

Once it is known which E i has produced z , the P -value or other inferential assessment should be made with reference to the experiment actually run. As we say in Cox and Mayo (2010 ):

The point essentially is that the marginal distribution of a P -value averaged over the two possible configurations is misleading for a particular set of data. It would mean that an individual fortunate in obtaining the use of a precise instrument in effect sacrifices some of that information in order to rescue an investigator who has been unfortunate enough to have the randomizer choose a far less precise tool. From the perspective of interpreting the specific data that are actually available, this makes no sense.

(p. 296)

To scotch his famous example, Cox (1958 ) introduces a principle: weak conditionality.

Weak Conditionality Principle (WCP ): If a mixture experiment (of the aforementioned type) is performed, then, if it is known which experiment produced the data, inferences about θ are appropriately drawn in terms of the sampling behavior in the experiment known to have been performed.

(Cox and Mayo 2010, p. 296)

It is called weak conditionality because there are more general principles of conditioning that go beyond the special case of mixtures of measuring instruments.

While conditioning on the instrument actually used seems obviously correct, nothing precludes the N-P theory from choosing the procedure “ which is best on the average over both experiments” (Lehmann and Romano 2005 , p. 394), and it’ s even possible that the average or unconditional power is better than the conditional. In the case of such a conflict, Lehmann says relevant conditioning takes precedence over average power (1993b). He allows that in some cases of acceptance sampling, the average behavior may be relevant, but in scientific contexts the conditional result would be the appropriate one (see Lehmann 1993b , p. 1246). Context matters. Did Neyman and Pearson ever weigh in on this? Not to my knowledge, but I’ m sure they’ d concur with N-P tribe leader Lehmann. Admittedly, if your goal in life is to attain a precise α level, then when discrete distributions preclude this, a solution would be to flip a coin to decide the borderline cases! (See also Example 4.6, Cox and Hinkley 1974 , pp. 95– 6; Birnbaum 1962 p. 491.)

Is There a Catch?

The “ two measuring instruments” example occupies a famous spot in the pantheon of statistical foundations, regarded by some as causing “ a subtle earthquake” in statistical foundations. Analogous examples are made out in terms of confidence interval estimation methods (Tour III, Exhibit (viii) ). It is a warning to the most behavioristic accounts of testing from which we have already distinguished the present approach. Yet justification for the conditioning (WCP) is fully within the frequentist error statistical philosophy, for contexts of scientific inference. There is no suggestion, for example, that only the particular data set be considered. That would entail abandoning the sampling distribution as the basis for inference, and with it the severity goal. Yet we are told that “ there is a catch” and that WCP leads to the Likelihood Principle (LP)!

It is not uncommon to see statistics texts argue that in frequentist theory one is faced with the following dilemma: either to deny the appropriateness of conditioning on the precision of the tool chosen by the toss of a coin, or else to embrace the strong likelihood principle, which entails that frequentist sampling distributions are irrelevant to inference once the data are obtained. This is a false dilemma. Conditioning is warranted to achieve objective frequentist goals, and the [weak] conditionality principle coupled with sufficiency does not entail the strong likelihood principle. The ‘ dilemma’ argument is therefore an illusion.

(Cox and Mayo 2010 , p. 298)

There is a large literature surrounding the argument for the Likelihood Principle, made famous by Birnbaum (1962). Birnbaum hankered for something in between radical behaviorism and throwing error probabilities out the window. Yet he himself had apparently proved there is no middle ground (if you accept WCP)! Even people who thought there was something fishy about Birnbaum’ s “ proof” were discomfited by the lack of resolution to the paradox. It is time for post-LP philosophies of inference. So long as the Birnbaum argument, which Savage and many others deemed important enough to dub a “ breakthrough in statistics,” went unanswered, the frequentist was thought to be boxed into the pathological examples. She is not.

In fact, I show there is a flaw in his venerable argument (Mayo 2010b , 2013a , 2014b ). That’ s a relief. Now some of you will howl, “ Mayo, not everyone agrees with your disproof! Some say the issue is not settled.” Fine, please explain where my refutation breaks down. It’ s an ideal brainbuster to work on along the promenade after a long day’ s tour. Don’ t be dismayed by the fact that it has been accepted for so long. But I won’ t revisit it here.

3.5 P-values Aren’ t Error Probabilities Because Fisher Rejected Neyman’ s Performance Philosophy

Both Neyman– Pearson and Fisher would give at most lukewarm support to standard significance levels such as 5% or 1%. Fisher, although originally recommending the use of such levels, later strongly attacked any standard choice.

(Lehmann 1993b , p. 1248)

Thus, Fisher rather incongruously appears to be attacking his own past position rather than that of Neyman and Pearson.

(Lehmann 2011 , p. 55)

By and large, when critics allege that Fisherian P -values are not error probabilities, what they mean is that Fisher wanted to interpret them in an evidential manner, not along the lines of Neyman’ s long-run behavior. I’ m not denying there is an important difference between using error probabilities inferentially and behavioristically. The truth is that N-P and Fisher used P -values and other error probabilities in both ways. 1 What they didn’ t give us is a clear account of the former. A big problem with figuring out the “ he said/they said”
between Fisher and Neyman– Pearson is that “ after 1935 so much of it was polemics” (Kempthorne 1976 ) reflecting a blow-up which had to do with professional rivalry rather than underlying philosophy. Juicy details later on.

We need to be clear on the meaning of an error probability. A method of statistical inference moves from data to some inference about the source of the data as modeled. Associated error probabilities refer to the probability the method outputs an erroneous interpretation of the data. Choice of test rule pins down the particular error; for example, it licenses inferring there’ s a genuine discrepancy when there isn’ t (perhaps of a given magnitude). The test method is given in terms of a test statistic d( X ), so the error probabilities refer to the probability distribution of d( X ), the sampling distribution, computed under an appropriate hypothesis. Since we need to highlight subtle changes in meaning, call these ordinary “ frequentist” error probabilities. (I can’ t very well call them error statistical error probabilities, but that’ s what I mean.) 2 We’ ll shortly require subscripts, so let this be error probability1 . Formal error probabilities have almost universally been associated with N-P statistics, and those with long-run performance goals. I have been disabusing you of such a straightjacketed view; they are vital in assessing how well probed the claim in front of me is. Yet my reinterpretation of error probabilities does not change their mathematical nature.

We can attach a frequentist performance assessment to any inference method. Post-data, these same error probabilities can, though they need not, serve to quantify the severity associated with an inference. Looking at the mathematics, it’ s easy to see the P -value as an error probability. Take Cox and Hinkley (1974 ):

For given observations y we calculate t = t obs = t (y ), say, and the level of significance p obs by p obs = Pr(T ≥ t obs ; H 0 ).

… Hence p obs is the probability that we would mistakenly declare there to be evidence against H 0 , were we to regard the data under analysis as just decisive against H 0 .

(p. 66)

Thus p obs would be the Type I error probability associated with the test procedure consisting of finding evidence against H 0 when reaching p obs . 3 Thus the P -value equals the corresponding Type I error probability. [I’ ve been using upper case P , but it’ s impossible to unify the literature.] Listen to Lehmann, speaking for the N-P camp:

[I]t is good practice to determine not only whether the hypothesis is accepted or rejected at the given significance level, but also to determine the smallest significance level … at which the hypothesis would be rejected for the given observation. This number, the so-called P-value gives an idea of how strongly the data contradict the hypothesis. It also enables others to reach a verdict based on the significance level of their choice.

(Lehmann and Romano 2005 , pp. 63– 4)

N-P theorists have no compunctions in talking about N-P tests using attained significance levels or P -values. Bayesians Gibbons and Pratt (1975 ) echo this view:

The P -value can then be interpreted as the smallest level of significance, that is, the ‘ borderline level’ , since the outcome observed would … not [be] significant at any smaller levels. Thus it is sometimes called the ‘ level attained’ by the sample … Reporting a P -value … permits each individual to choose his own … maximum tolerable probability for a Type I error.

(p. 21)

Is all this just a sign of texts embodying an inconsistent hybrid? I say no, and you should too.

A certain tribe of statisticians professes to be horrified by the remarks of Cox and Hinkley, Lehmann and Romano, Gibbons and Pratt and many others. That these remarks come from leading statisticians, members of this tribe aver, just shows the depth of a dangerous “ confusion over the evidential content (and mixing) of p ’ s and α ’ s” (Hubbard and Bayarri 2003 , p. 175). On their view, we mustn’ t mix what they call “ evidence and error” : F and N-P are incompatible. For the rest of this tour, we’ ll alternate between the museum and engaging the Incompatibilist tribes themselves. When viewed through the tunnel of the Incompatibilist statistical philosophy, these statistical founders appear confused.

The distinction between evidence (p ’ s) and error ( α ’ s) is not trivial … it reflects the fundamental differences between Fisher’ s ideas on significance testing and inductive inference, and [N-P’ s] views on hypothesis testing and inductive behavior.

(Hubbard and Bayarri 2003 , p. 171)

What’ s fascinating is that the Incompatibilists admit it’ s the philosophical difference they’ re on about, not a mathematical one. The paper that has become the centerpiece for the position in this subsection is Berger and Sellke (1987 ). They ask:

Can P values be justified on the basis of how they perform in repeated use? We doubt it. For one thing, how would one measure the performance of P values? With significance tests and confidence intervals, they are either right or wrong, so it is possible to talk about error rates. If one introduces a decision rule into the situation by saying that H 0 is rejected when the P value < 0.05, then of course the classical error rate is 0.05.

(p. 136)

Good. Then we can agree a P -value is, mathematically, an error probability. Berger and Sellke are merely opining that Fisher wouldn’ t have justified their use on grounds of error rate performance. That’ s different. Besides, are we so sure Fisher wouldn’ t sully himself with crass error probabilities, and dichotomous tests? Early on at least, Fisher appears as a behaviorist par excellence. That he is later found “ attacking his own position,” as Lehmann puts it, is something else.

Mirror Mirror on the Wall, Who’ s the More Behavioral of Them All?

N-P were striving to emulate the dichotomous interpretation they found in Fisher:

It is open to the experimenter to be more or less exacting in respect of the smallness of the probability he would require before he would be willing to admit that his observations have demonstrated a positive result. It is obvious that an experiment would be useless of which no possible result would satisfy him. … It is usual and convenient for the experimenters to take 5 per cent as a standard level of significance, in the sense that they are prepared to ignore all results which fail to reach this standard, and, by this means, to eliminate from further discussion the greater part of the fluctuations which chance causes have introduced into their experimental results.

(Fisher 1935a , pp. 13– 14)

Fisher’ s remark can be taken to justify the tendency to ignore negative results or stuff them in file drawers, somewhat at odds with his next lines, the ones that I specifically championed in Excursion 1 : “ we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result…” (1935a , p. 14). 4 This would require us to keep the negative results around for a while. How else could we see if we are rarely failing, or often succeeding?

What I mainly want to call your attention to now are the key phrases “ willing to admit,” “ satisfy him,” “ deciding to ignore.” What are these, Neyman asks, but actions or behaviors? He’ d learned from R. A. Fisher! So, while many take the dichotomous “ up-down” spirit of tests as foreign to Fisher, it is not foreign at all. Again from Fisher (1935a ):

Our examination of the possible results of the experiment has therefore led us to a statistical test of significance, by which these results are divided into two classes with opposed interpretations … those which show a significant discrepancy from a certain hypothesis; … and on the other hand, results which show no significant discrepancy from this hypothesis.

(pp. 15– 16)

No wonder Neyman could counter Fisher’ s accusations that he’ d turned his tests into tools for inductive behavior by saying, in effect, look in the mirror (for instance, in the acrimonious exchange of 1955– 6, 20 years after the blow-up): Pearson and I were only systematizing your practices for how to interpret data, taking explicit care to prevent untoward results that you only managed to
avoid on intuitive grounds!

Fixing Significance Levels.

What about the claim that N-P tests fix the Type I error probability in advance, whereas P -values are post-data? Doesn’ t that prevent a P -value from being an error probability? First, we must distinguish between fixing the significance level for a test prior to data collection, and fixing a threshold to be used across one’ s testing career. Fixing α and power is part of specifying a test with reasonable capabilities of answering the question of interest. Having done so, there’ s nothing illicit about reporting the achieved or attained significance level, and it is even recommended by Lehmann. As for setting a threshold for habitual practice, that’ s actually more Fisher than N-P.

Lehmann is flummoxed by the association of fixed levels of significance with N-P since “ [U]nlike Fisher, Neyman and Pearson (1933 , p. 296) did not recommend a standard level but suggested that ‘ how the balance [between the two kinds of error] should be struck must be left to the investigator’” (Lehmann 1993b , p. 1244). From their earliest papers, Neyman and Pearson stressed that the tests were to be “ used with discretion and understanding” depending on the context (Neyman and Pearson 1928 , p. 58). In a famous passage, Fisher (1956 ) raises the criticism – but without naming names:

A man who ‘ rejects’ a hypothesis provisionally, as a matter of habitual practice, when the significance is at the 1% level or higher, will certainly be mistaken in not more than 1% of such decisions. For when the hypothesis is correct he will be mistaken in just 1% of these cases, and when it is incorrect he will never be mistaken in rejection … However, the calculation is absurdly academic, for in fact no scientific worker has a fixed level of significance at which from year to year, and in all circumstances, he rejects hypotheses; he rather gives his mind to each particular case in the light of his evidence and his ideas.

‹ Prev Next ›