Statistical Inference as Severe Testing

Page 22

by Deborah G Mayo

FEV(ii) : A moderate P- value is evidence of the absence of a discrepancy δ from H 0 , only if there is a high probability the test would have given a worse fit with H 0 (i.e., smaller P- value) were a discrepancy δ to exist.

(ibid., pp. 83– 4)

This again is an “ if-then” or conditional claim. These are canonical pieces of statistical reasoning, in their naked form as it were. To dress them up to connect with actual questions and problems of inquiry requires context-dependent, background information.

FIRST Interpretations: Fairly Intimately Related to the Statistical Test – Cox’ s Taxonomy

In the statistical analysis of scientific and technological data, there is virtually always external information that should enter in reaching conclusions about what the data indicate with respect to the primary question of interest. Typically, these background considerations enter not by a probability assignment but by identifying the question to be asked, designing the study, interpreting the statistical results and relating those inferences to primary scientific ones …

(Mayo and Cox 2006 , p. 84)

David Cox calls for an interpretive guide between a test’ s mathematical formulation and substantive applications: “ I think that more attention to these rather vague general issues is required if statistical ideas are to be used in the most fruitful and wide-ranging way” (Cox 1977 , p. 62). There are aspects of the context that go beyond the mathematics but which are Fairly Intimately Related to the Statistical Test (FIRST) interpretations. I’ m distinguishing these FIRST interpretations from wider substantive inferences, not that there’ s a strict red line difference.

While warning that “ it is very bad practice to summarise an important investigation solely by a value of P ” (1982 , p. 327), Cox gives a rich taxonomy of null hypotheses that recognizes how significance tests can function as part of complex and context-dependent inquiries (1977 , pp. 51– 2). Pure or simple Fisherian significance tests (with no explicit alternative) are housed within the taxonomy, not separated out as some radically distinct entity. If his taxonomy had been incorporated into the routine exposition of tests, we could have avoided much of the confusion we are still suffering with. The proper way to view significance tests acknowledges a variety of problem situations:

Are we testing parameter values within some overriding model? (fully embedded)

Are we merely checking if a simplification will do? (nested alternative)

Do we merely seek the direction of an effect already presumed to exist? (dividing)

Would a model pass an audit of its assumptions? (test of assumption)

Should we worry about data that appear anomalous for a theory that has already passed severe tests? (substantive)

Although Fisher, strictly speaking, had only the null hypothesis, and context directed an appropriate test statistic, the result of such a selection is that the test is sensitive to a type of discrepancy. Even if they only become explicit after identifying a test statistic – which some regard as more basic (e.g., Senn) – we may still regard them as alternatives.

Sensitivity Achieved or Attained

For a Fisherian like Cox, a test’ s power only has relevance pre-data, in planning tests, but, like Fisher, he can measure “ sensitivity” :

In the Neyman– Pearson theory of tests, the sensitivity of a test is assessed by the notion of power , defined as the probability of reaching a preset level of significance … for various alternative hypotheses. In the approach adopted here the assessment is via the distribution of the random variable P , again considered for various alternatives.

(Cox 2006a , p. 25)

This is the key: Cox will measure sensitivity by a function we may abbreviate as Π ( γ ). Computing Π ( γ ) may be regarded as viewing the P -value as a statistic. That is:

Π ( γ ) = Pr(P ≤ p obs ; µ 0 + γ ).

The alternative is µ 1 = µ 0 + γ . Using the P -value distribution has a long history and is part of many approaches. Given the P -value inverts the distance, it is clearer and less confusing to formulate Π ( γ ) in terms of the test statistic d . Π ( γ ) is very similar to power in relation to alternative µ 1 , except that Π ( γ ) considers the observed difference rather than the N-P cut-off c α :

Π ( γ ) = Pr(d ≥ d 0 ; µ 0 + γ ),

POW( γ ) = Pr(d ≥ c α ; µ 0 + γ ).

Π may be called a “ sensitivity function,” or we might think of Π ( γ ) as the “ attained power” to detect discrepancy γ (Section 5.3). The nice thing about power is that it’ s always in relation to an observed difference from a test or null hypothesis, which gives it a reference. Let’ s agree that Π will always relate to an observed difference from a designated test hypothesis H 0 .

Aspects of Cox’ s Taxonomy

I won’ t try to cover Cox’ s full taxonomy, which has taken different forms. I propose that the main delineating features are, first, whether the null and alternative exhaust the answers or parameter space for the given question, and, second, whether the null hypothesis is considered a viable basis for a substantive research claim, or merely as a reference for exploring the way in which it is false. None of these are hard and fast distinctions, but you’ ll soon see why they are useful. I will adhere closely to what Cox has said about the taxonomy and the applications of FEV; all I add is a proposed synthesis. I restrict myself now to a single parameter. We assume the P -value has passed an audit (except where noted).

1. Fully embedded. Here we have exhaustive parametric hypotheses governed by a parameter θ , such as the mean deflection of light at the 1919 eclipse, or the mean temperature. H 0 : µ = µ 0 vs. H 1 : µ >µ 0 is a typical N-P setting. Strictly speaking, we may have θ = (µ,k) with additional parameters k to be estimated. This formulation, Cox notes, “ will suggest the most sensitive test statistic, essentially equivalent to the best estimate of µ ” (Cox 2006a , p. 37).

A. P-value is modest (not small): Since the data accord with the null hypothesis, FEV directs us to examine the probability of observing a result more discordant from H 0 if µ = µ 0 + γ : Π ( γ ) = Pr(d ≥ d 0 ; µ 0 + γ ).

If that probability is very high, following FEV(ii), the data indicate that µ < µ 0 + γ .

Here Π ( γ ) gives the severity with which the test has probed the discrepancy γ . So we don’ t merely report “ no evidence against the null,” we report a discrepancy that can be ruled out with severity. “ This avoids unwarranted interpretations of consistency with H 0 with insensitive tests … [and] is more relevant to specific data than is the notion of power” (Mayo and Cox 2006 , p. 89).

B. P-value is small: From FEV(i), a small P -value indicates evidence of some discrepancy µ > µ 0 since Pr(d < d 0 ;H 0 ) = 1 − P is large. This is the basis for ordinary (statistical) falsification.

However, we add, “ if a test is so sensitive that a P -value as small as or even smaller than the one observed is probable even when µ ≤ µ 0 + γ , then a small value of P ” is poor evidence of a discrepancy from H 0 in excess of γ (ibid.). That is, from FEV(ia), if Π ( γ ) = Pr(d ≥ d 0 ; µ 0 + γ ) = moderately high (greater than 0.3, 0.4, 0.5), then there’ s poor grounds for inferring µ > µ 0 + γ . This is equivalent to saying the SEV(µ > µ 0 + γ ) is poor.

There’ s no need to set a sharp line between significance or not in this construal – extreme cases generally suffice. FEV leads to an inference as to both what’ s indicated, and what’ s not indicated. Both are required by a severe tester. Go back to our accident at the water plant. The non-significant result, = 151 in testing μ ≤ 150 vs. μ > 150, only attains a P-value of 0.16. SEV(C : μ > 150.5) is 0.7 (Table 3.3 ). Not terribly high, but if that discrepancy was of interest, it wouldn’ t be ignorable. A reminder: we are not making inferences about point values, even though we need only compute Π at a point. In this first parametric pigeonhole, confidence intervals can be formed, though we wouldn’ t limit them to the typical 0.95 or 0.99 levels. 7

2. Nested alternative (non-exhaustive). In a second p
igeonhole an alternative statistical hypothesis H 1 is considered not “ as a possible base for ultimate interpretation but as a device for determining a suitable test statistic” (Mayo and Cox 2006 , p. 85). Erecting H 1 may be only a convenient way to detect small departures from a given statistical model. For instance, one may use a quadratic model H 1 to test the adequacy of a linear relation. Even though polynomial regressions are a poor base for final analysis, they are very convenient and interpretable for detecting small departures from linearity. (ibid.)

Failing to reject the null (moderate P -value) might be taken to indicate the simplifying assumption is adequate; whereas rejecting the null (small P -value) is not evidence for alternative H 1 . That’ s because there are lots of non-linear models not probed by this test. The H 0 and H 1 do not exhaust the space.

A. P-value is modest (not small): At best it indicates adequacy of the model in the respects well probed; that is, it indicates the absence of discrepancies that, very probably, would have resulted in a smaller P-value.

B. P-value small : This indicates discrepancy from the null in the direction of the alternative, but it is unwarranted to infer the particular H 1 insofar as other non-linear models could be responsible.

We are still employing the FEV principle, even where it is qualitative.

3. Dividing nulls: H 0 : µ = µ 0 vs. H 1 : µ > µ 0 and H 0 : µ =µ 0 vs. H 1 : µ < µ 0 . In this pigeonhole, we may already know or suspect the null is false and discrepancies exist: but which? Suppose the interest is comparing two or more treatments. For example, compared with a standard, a new drug may increase or may decrease survival rates.

The null hypothesis of zero difference divides the possible situations into two qualitatively different regions with respect to the feature tested. To look at both directions, one combines two tests, the first to examine the possibility that µ > µ 0 , say, the second for µ < µ 0 . The overall significance level is twice the smaller P -value, because of a “ selection effect.” One may be wrong in two ways. It is standard to report the observed direction and double the initial P -value (if it’ s a two-sided test).

While a small P -value indicates a direction of departure (e.g., which of two treatments is superior), failing to get a small P -value here merely tells us the data do not provide adequate evidence even of the direction of any difference. Formally, the statistical test may look identical to the fully embedded case, but the nature of the problem, and your background knowledge, yields a more relevant construal. These interpretations are still FIRST. You can still report the upper bound ruled out with severity, bringing this case closer to the fully embedded case (Table 3.4 ).

Table 3.4 FIRST Interpretations

Taxon Remarks Small P -value P -value Not Small

1. Fully embeddedexhaustive

H 1 may be the basis for a substantive interpretation Indicates µ > µ 0 + γ iff Π ( γ ) is low

If Π ( γ ) is high, there’ s poor indication of µ > µ 0 + γ

2. Nested alternativesnon-exhaustive

H 1 is set out to search departures from H 0 Indicates discrepancy from H 0 but not grounds to infer H 1

Indicates H 0 is adequate in respect probed

3. Dividing exhaustive µ ≤ µ 0 vs. µ > µ 0 ; a discrepancy is presumed, but in which direction? Indicates direction of discrepancy

If Π ( γ ) low, µ > µ 0 + γ is indicated

Data aren’ t adequate even to indicate direction of departure

4. Model assumptions (i) omnibus exhaustive

e.g., non-parametric runs test for IID (may have low power) Indicates departure from assumptions probed, but not specific violation

Indicates the absence of violations the test is capable of detecting

Model assumptions (ii) focused non-exhaustive

e.g., parametric test for specific type of dependence Indicates departure from assumptions in direction of H 1 but can’ t infer H 1

Indicates the absence of violations the test is capable of detecting

4. Null hypotheses of model adequacy. In “ auditing” a P -value, a key question is: how can I check the model assumptions hold adequately for the data in hand? We distinguish two types of tests of assumptions (Mayo and Cox 2006 , p. 89): (i) omnibus and (ii) focused.

(i) With a general omnibus test, a group of violations is checked all at once. For example: H 0 : IID (independent and identical distribution) assumptions hold vs. H 1 : IID is violated. The null and its denial exhaust the possibilities, for the question being asked. However, sensitivity can be so low that failure to reject may be uninformative. On the other hand, a small P -value indicates H 1 : there’ s a departure somewhere . The very fact of its low sensitivity indicates that when the alarm goes off something’ s there. But where? Duhemian problems loom. A subsequent task would be to pin this down.

(ii) A focused test is sensitive to a specific kind of model inadequacy, such as a lack of independence. This lands us in a situation analogous to the non-exhaustive case in “ nested alternatives.” Why? Suppose you erect an alternative H 1 describing a particular type of non-independence, e.g., Markov. While a small P -value indicates some departure, you cannot infer H 1 so long as various alternative models, not probed by this test, could account for it. It may only give suggestions for alternative models to try. The interest may be in the effect of violated assumptions on the primary (statistical) inference if any. We might ask: Are the assumptions sufficiently questionable to preclude using the data for the primary inference? After a lunch break at Einstein‘ s Cafe, we’ ll return to the museum for an example of that.

Scotching a Famous Controversy

At a session on the replication crisis at a 2015 meeting of the Society for Philosophy and Psychology, philosopher Edouard Machery remarked as to how, even in so heralded a case as the eclipse tests of GTR, one of the results didn’ t replicate the other two. The third result pointed, not to Einstein’ s prediction, but as Eddington ([1920]1987 ) declared, “ with all too good agreement to the ‘ half-deflection,’ that is to say, the Newtonian value” (p. 117). He was alluding to a famous controversy that has grown up surrounding the allegation that Eddington selectively ruled out data that supported the Newtonian “ half-value” against the Einsteinian one. Earman and Glymour (1980 ), among others, alleged that Dyson and Eddington threw out the results unwelcome for GTR for political purposes (“… one of the chief benefits to be derived from the eclipse results was a rapprochement between German and British scientists and an end to talk of boycotting German science” (p. 83)). 8 Failed replication may indeed be found across the sciences, but this particular allegation is mistaken. The museum’ s display on “ Data Analysis in the 1919 Eclipse” shows a copy of the actual notes penned on the Sobral expedition before any data analysis:

May 30, 3 a.m., four of the astrographic plates were developed … It was found that there had been a serious change of focus … This change of focus can only be attributed to the unequal expansion of the mirror through the sun’ s heat … It seems doubtful whether much can be got from these plates.

(Dyson et al. 1920 , p. 309)

Although a fair amount of (unplanned) data analysis was required, it was concluded that there was no computing a usable standard error of the estimate. The hypothesis:

The data x 0 (from Sobral astrographic plates) were due to systematic distortion by the sun’ s heat, not to the deflection of light,

passes with severity. An even weaker claim is all that’ s needed: we can’ t compute a valid estimate of error. And notice how very weak the claim to be corroborated is!

The mirror distortion hypothesis hadn’ t been predesignated, but it is altogether justified to raise it in auditing the data: It could have been chewing gum or spilled coffee that spoilt the results. Not only that, the same data hinting at the mirror distortion are to be used in testing the mirror distortion hypothesis (though differently modeled)! That sufficed to falsify the requirement that there was no serious change of focus (scale effect)
between the eclipse and night plates. Even small systematic errors are crucial because the resulting scale effect from an altered focus quickly becomes as large as the Einstein predicted effect. Besides, the many staunch Newtonian defenders would scarcely have agreed to discount an apparently pro-Newtonian result.

The case was discussed and soon settled in the journals of the time: the brouhaha came later. It turns out that, if these data points are deemed usable, the results actually point to the Einsteinian value, not the Newtonian value. A reanalysis in 1979 supports this reading (Kennefick 2009 ). Yes, in 1979 the director of the Royal Greenwich Observatory took out the 1919 Sobral plates and used a modern instrument to measure the star positions, analyzing the data by computer.

[T]he reanalysis provides after-the-fact justification for the view that the real problem with the Sobral astrographic data was the difficulty … of separating the scale change from the light deflection.

(Kennefick 2009 , p. 42)

What was the result of this herculean effort to redo the data analysis from 60 years before?

Ironically, however, the 1979 paper had no impact on the emerging story that something was fishy about the 1919 experiment … so far as I can tell, the paper has never been cited by anyone except for a brief, vague reference in Stephen Hawking’ s A Brief History of Time [which actually gets it wrong and was corrected].

‹ Prev Next ›