Book Read Free

Statistical Inference as Severe Testing

Page 21

by Deborah G Mayo


  d( X ) ~ N( δ 1 ,1), where δ 1 = √ n (µ 1 − µ 0 )⁄ σ .

  (2) One-sided Student’ s t test. Each Xi is NIID, N(µ , σ 2 ), σ unknown: H 0 : µ ≤ µ 0 against H 1 : µ > µ 0 :

  Two-sided Normal test of the mean H 0 : µ = µ 0 against H 1 : µ ≠ µ 0 :

  Evaluating the Type I error probability requires the distribution of d( X ) under H 0 : d( X ) ~ St(n − 1), the Student’ s t distribution with (n − 1) degrees of freedom (df).

  Evaluating the Type II error probability (and power) requires the distribution of d( X ) under H 1 [µ = µ 1 ]: d( X ) ~ St( δ 1 , n − 1) , where δ 1 =√ n (µ 1 − µ 0 )/ σ is the non-centrality parameter.

  This is the UMP, unbiased test.

  (3) The difference between two means (where it is assumed the variances are equal):

  against

  A Uniformly Most Powerful Unbiased (UMPU) test is defined by

  Under

  under

  Many excellent sources of types of tests exist, so I’ ll stop with these.

  Exhibit (i): N-P Methods as Severe Tests: First Look (Water Plant Accident).

  There’ s been an accident at a water plant where our ship is docked, and the cooling system had to be repaired. It is meant to ensure that the mean temperature of discharged water stays below the temperature that threatens the ecosystem, perhaps not much beyond 150 degrees Fahrenheit. There were 100 water measurements taken at randomly selected times and the sample mean computed, each with a known standard deviation σ = 10. When the cooling system is effective, each measurement is like observing X ~ N(150, 102 ). Because of this variability, we expect different 100-fold water samples to lead to different values of , but we can deduce its distribution. If each X ~ N( μ = 150, 102 ) then is also Normal with μ = 150, but the standard deviation of is only So .

  It is the distribution of that is the relevant sampling distribution here. Because it’ s a large random sample, the sampling distribution of is Normal or approximately so, thanks to the Central Limit Theorem. Note the mean of the sampling distribution of is the same as the underlying mean, both are μ . The frequency link was created by randomly selecting the sample, and we assume for the moment it was successful. Suppose they are testing

  H 0 : μ ≤ 150 vs. H 1 : μ > 150.

  The test rule for α = 0.025 is

  For simplicity, let’ s go to the 2-standard error cut-off for rejection:

  The test statistic d( x ) is a standard Normal variable: , which, for , is 2. The area to the right of 2 under the standard Normal is around 0.025.

  Now we begin to move beyond the strict N-P interpretation. Say is just significant at the 0.025 level ( ). What warrants taking the data as indicating μ > 150 is not that they’ d rarely be wrong in repeated trials on cooling systems by acting this way – even though that’ s true. There’ s a good indication that it’ s not in compliance right now. Why? The severity rationale : Were the mean temperature no higher than 150, then over 97% of the time their method would have resulted in a lower mean temperature than observed. Were it clearly in the safe zone, say μ = 149 degrees, a lower observed mean would be even more probable. Thus, indicates some positive discrepancy from H 0 (though we don’ t consider it rejected by a single outcome). They’ re going to take another round of measurements before acting. In the context of a policy action, to which this indication might lead, some type of loss function would be introduced. We’ re just considering the evidence, based on these measurements; all for illustrative purposes.

  Severity Function

  I will abbreviate “ the severity with which claim C passes test T with data x ” :

  SEV(test T, outcome x , claim C ).

  Reject/Do Not Reject will be interpreted inferentially, in this case as an indication or evidence of the presence or absence of discrepancies of interest.

  Let us suppose we are interested in assessing the severity of C: μ > 153 . I imagine this would be a full-on emergency for the ecosystem!

  Reject H 0 .

  Suppose the observed mean is , just at the cut-off for rejecting H 0 :

  The data reject H 0 at level 0.025. We want to compute

  We may say: “ the data accord with C : μ > 153,” that is, severity condition (S-1) is satisfied; but severity requires there to be at least a reasonable probability of a worse fit with C if C is false (S-2). Here, “ worse fit with C ” means (i.e., d( x 0 ) ≤ 2). Given it’ s continuous, as with all the following examples, < or ≤ give the same result. The context indicates which is more useful. This probability must be high for C to pass severely; if it’ s low, it’ s BENT.

  We need is false). To say μ > 153 is false is to say μ ≤ 153 . So we want ). But we need only evaluate severity at the point μ = 153 , because this probability is even greater for μ < 153 :

  Here, . Thus . Very low. Our minimal severity principle blocks μ > 153 because it’ s fairly probable (84% of the time) that the test would yield an even larger mean temperature than we got, if the water samples came from a body of water whose mean temperature is 153. Table 3.1 gives the severity values associated with different claims, given . Call tests of this form T+

  Table 3.1 Reject in test T+: H 0 : μ ≤ 150 vs. H 1 : μ > 150 with = 152

  Claim Severity

  μ > μ 1 Pr( ≤ 152; μ = μ 1 )

  μ > 149 0.999

  μ > 150 0.97

  μ > 151 0.84

  μ > 152 0.5

  μ > 153 0.16

  In each case, we are making inferences of form μ > μ 1 = 150 + γ , for different values of γ . To merely infer μ > 150, the severity is 0.97 since . While the data give an indication of non-compliance, μ > 150, to infer C : μ > 153 would be making mountains out of molehills. In this case, the observed difference just hit the cut-off for rejection. N-P tests leave things at that coarse level in computing power and the probability of a Type II error, but severity will take into account the actual outcome. Table 3.2 gives the severity values associated with different claims, given .

  Table 3.2 Reject in test T+: H 0 : μ ≤ 150 vs. H 1 : μ > 150 with = 153

  Claim Severity (with = 153)

  μ > μ 1 Pr( ≤ 153; μ = μ 1 )

  μ > 149 ~1

  μ > 150 0.999

  μ > 151 0.97

  μ > 152 0.84

  μ > 153 0.5

  If “ the major criticism of the Neyman– Pearson frequentist approach” is that it fails to provide “ error probabilities fully varying with the data,” as J. Berger alleges (2003 , p. 6), then we’ ve answered the major criticism.

  Non-rejection.

  Now suppose , so the test does not reject H 0 . The standard formulation of N-P (as well as Fisherian) tests stops there. But we want to be alert to a fallacious interpretation of a “ negative” result: inferring there’ s no positive discrepancy from μ = 150. No (statistical) evidence of non-compliance isn’ t evidence of compliance; here’ s why. We have (S-1): the data “ accord with” H 0 , but what if the test had little capacity to have alerted us to discrepancies from 150? The alert comes by way of “ a worse fit” with H 0 – namely, a mean . Condition (S-2) requires us to consider ), which is only 0.16. To get this, standardize to obtain a standard Normal variate: ; and . Thus, . Table 3.3 gives the severity values associated with different inferences of form μ ≤ μ 1 = 150 + γ , given .

  Table 3.3 Non-reject in test T+: H 0 : μ ≤ 150 vs. H 1 : μ > 150 with = 151

  Claim Severity

  μ ≤ μ 1 Pr( > 151; μ = μ 1 )

  μ ≤ 150 0.16

  μ ≤ 150.5 0.3

  μ ≤ 151 0.5

  μ ≤ 152 0.84

  μ ≤ 153 0.97

  Can they at least say that is a good indication that μ ≤ 150.5? No, , [Z = 151 – 150.5 = 0.5]. But is a good indication that μ ≤ 152 and μ ≤ 153 (with severity indications of 0.84 and 0.97, respectively).

  You might say, assessing severity is no different from what we would do with a judicious use of existing error probabilities. Th
at’ s what the severe tester says. Formally speaking, it may be seen merely as a good rule of thumb to avoid fallacious interpretations. What’ s new is the statistical philosophy behind it. We no longer seek either probabilism or performance, but rather using relevant error probabilities to assess and control severity. 5

  3.3 How to Do All N-P Tests Do (and More) While a Member of the Fisherian Tribe

  When Karl Pearson retired in 1933, he refused to let his chair go to Fisher, so they split the department into two: Fisher becomes Galton Chair and Head of Eugenics, while Egon Pearson becomes Head of Applied Statistics. They are one floor removed (Fisher on top)! The Common Room had to be “ carefully shared,” as Constance Reid puts it: “ Pearson’ s group had afternoon tea at 4; and at 4:30, when they were safely out of the way, Fisher and his group trouped in” (C. Reid 1998 , p. 114). Fisher writes to Neyman in summer of 1933:

  (cited in Lehmann 2011 , p. 58)

  You will be interested to hear that the Dept. of Statistics has now been separated officially from the Galton Laboratory. I think Egon Pearson is designated as Reader in Statistics. This arrangement will be much laughed at, but it will be rather a poor joke … I shall not lecture on statistics, but probably on ‘ the logic of experimentation’ .

  Finally E. Pearson was able to offer Neyman a position at University College, and Neyman, greatly relieved to depart Poland, joins E. Pearson’ s department in 1934. 6

  Neyman doesn’ t stay long. He leaves London for Berkeley in 1938, and develops the department into a hothouse of statistics until his death in 1981. His first Ph.D. student is Erich Lehmann in 1946. Lehmann’ s Testing Statistical Hypotheses , 1959, the canonical N-P text, developed N-P methods very much in the mode of the N-P-Wald, behavioral-decision language. I find it interesting that even Neyman’ s arch opponent, subjective Bayesian Bruno de Finetti, recognized that “ inductive behavior … that was for Neyman simply a slogan underlining and explaining the difference between his own, the Bayesian and the Fisherian formulations” became, with Wald’ s work, “ something much more substantial.” De Finetti called this “ the involuntarily destructive aspect of Wald’ s work” (1972 , p. 176). Cox remarks:

  [T]here is a distinction between the Neyman– Pearson formulation of testing regarded as clarifying the meaning of statistical significance via hypothetical repetitions and that same theory regarded as in effect an instruction on how to implement the ideas by choosing a suitable α in advance and reaching different decisions accordingly. The interpretation to be attached to accepting or rejecting a hypothesis is strongly context-dependent …

  (Cox 2006a , p. 36)

  If N-P long-run performance concepts serve to clarify the meaning of statistical significance tests, yet are not to be applied literally, but rather in some inferential manner – call this the meaning vs. application distinction – the question remains – how?

  My answer, in terms of severity, may be used whether you prefer the N-P tribe (tests or confidence intervals) or the Fisherian tribe. What would that most eminent Fisherian, Sir David Cox, say? In 2004, in a session we were in on statistical philosophy, at the semi-annual Lehmann conference, we asked: Was it possible to view “ Frequentist Statistics as a Theory of Inductive Inference” ? If this sounds familiar it’ s because it echoes a section from Neyman’ s quarrel with Carnap (Section 2.7 ), but how does a Fisherian answer it? We began “ with the core elements of significance testing in a version very strongly related to but in some respects different from both Fisherian and Neyman– Pearson approaches, at least as usually formulated” (Mayo and Cox 2006 , p. 80). First, there is no suggestion that the significance test would typically be the only analysis reported. Further, we agree that “ the justification for tests will not be limited to appeals to long-run behavior but will instead identify an inferential or evidential rationale” (ibid., p. 81).

  With N-P results available, it became easier to understand why intuitively useful tests worked for Fisher. N-P and Fisherian tests, while starting from different places, “ lead to the same destination” (with few exceptions) (Cox 2006a , p. 25). Fisher begins with seeking a test statistic that reduces the data as much as possible, and this leads him to a sufficient statistic. Let’ s take a side tour to sufficiency.

  Exhibit (ii): Side Tour of Sufficient Statistic.

  Consider n independent trials X ≔ (X 1 , X 2 , … , Xn ) each with a binary outcome (0 or 1), where the probability of success is an unknown constant θ associated with Bernoulli trials. The number of successes in n trials, Y = X 1 + X 2 + … + Xn is Binomially distributed with parameters θ and n . The sample mean, which is just = Y /n , is a natural estimator of θ with a highly desirable property: it is sufficient , i.e., it is a function of the sufficient statistic Y . Intuitively, a sufficient statistic reduces the n -dimensional sample X into a statistic of much smaller dimensionality without losing any relevant information for inference purposes. Y reduces the n -fold outcome x to one dimension: the number of successes in n trials. The parameter of the Binomial model θ also has one dimension (the probability of success on each trial).

  Formally, a statistic Y is said to be sufficient for θ when the distribution of the sample is no longer a function of θ when conditioned on Y , i.e., f(x | y ) does not depend on θ ,

  f(x ; θ ) = f(y ; θ ) f(x |y ).

  Knowing the distribution of the sufficient statistic Y suffices to compute the probability of any data set x . The test statistic d( X ) in the Binomial case is √ n ( – θ 0 )/ σ , σ = and, as required, gets larger as deviates from θ 0 . Thanks to being a function of the sufficient statistic Y , it is the basis for a test statistic with maximal sensitivity to inconsistencies with the null hypothesis.

  The Binomial experiment is equivalent to having been given the data x 0 =(x 1 , x 2 , … , x n ) in two stages (Cox and Mayo 2010 , p. 285):

  First, you’ re told the value of Y , the number of successes out of n Bernoulli trials. Then an inference can be drawn about θ using the sampling distribution of Y .

  Second, you learn the value of the specific data, e.g., the first k trials are successes, the rest failure. The second stage is equivalent to observing a realization of the conditional distribution of X given Y = y . If the model is appropriate then “ the second phase is equivalent to a random draw from a totally known distribution.” All permutations of the sequence of successes and failures are equally probable (ibid., p. 284– 5).

  “ Because this conditional distribution is totally known, it can be used to assess the validity of the assumed model.” (ibid.) Notice that for a given x within a given Binomial experiment, the ratio of likelihoods at two different values of θ depends on the data only through Y . This is called the weak likelihood principle in contrast to the general (or strong) LP in Section 1.5 .

  Principle of Frequentist Evidence, FEV

  Returning to our topic, “ Frequentist Statistics as a Theory of Inductive Inference,” let me weave together three threads: (1) the Frequentist Principle of Evidence (Mayo and Cox 2006 ), (2) the divergent interpretations growing out of Cox’ s taxonomy of test hypotheses, and (3) the links to statistical inference as severe tests. As a starting point, we identified a general principle that we dubbed the Frequentist Principle of Evidence, FEV:

  FEV(i) : x is … evidence against H 0 (i.e., evidence of a discrepancy from H 0 ), if and only if, were H 0 a correct description of the mechanism generating x , then, with high probability, this would have resulted in a less discordant result than is exemplified by x . (Mayo and Cox 2006 , p. 82; substituting x for y )

  This sounds wordy and complicated. It’ s much simpler in terms of a quantitative difference as in significance tests. Putting FEV(i) in terms of formal P -values, or test statistic d (abbreviating d( X )):

  FEV(i) : x is evidence against H 0 (i.e., evidence of discrepancy from H 0 ), if and only if the P -value Pr(d ≥ d 0 ; H 0 ) is very low (equivalently, Pr(d < d 0 ; H 0 ) = 1 – P is very high).

  (We used “ strong evidence
” , although I would call it a mere “ indication” until an appropriate audit was passed.) Our minimalist claim about bad evidence, no test (BENT) can be put in terms of a corollary (from contraposing FEV(i)):

  FEV(ia) : x is poor evidence against H 0 (poor evidence of discrepancy from H 0 ), if there’ s a high probability the test would yield a more discordant result, if H 0 is correct.

  Note the one-directional ‘ if’ claim in FEV(ia). We wouldn’ t want to say this is the only way x can be BENT.

  Since we wanted to move away from setting a particular small P -value, we refer to “ P -small” (such as 0.05, 0.01) and “ P -moderate” , or “ not small” (such as 0.3 or greater). We need another principle in dealing with non-rejections or insignificant results. They are often imbued with two very different false interpretations: one is that (a) non-significance indicates the truth of the null, the other is that (b) non-significance is entirely uninformative.

  The difficulty with (a), regarding a modest P -value as evidence in favor of H 0 , is that accordance between H 0 and x may occur even if rivals to H 0 seriously different from H 0 are true. This issue is particularly acute when the capacity to have detected discrepancies is low. However, as against (b), null results have an important role ranging from the scrutiny of substantive theory – setting bounds to parameters to scrutinizing the capability of a method for finding something out. In sync with our “ arguing from error” (Excursion 1 ), we may infer a discrepancy from H 0 is absent if our test very probably would have alerted us to its presence (by means of a more significant P -value).

 

‹ Prev