Statistical Inference as Severe Testing

Page 28

by Deborah G Mayo

A generic 1 − α lower confidence interval estimator is .

A specific 1 − α lower confidence interval estimate is .

The corresponding value for α is close enough to 0.005 to allow c 0.005 = 2.5 (it’ s actually closer to 0.006). The impressive thing is that, regardless of the true value of μ , these rules have high coverage probability. If, for any observed , in our example, you shout out

your assertions will be correct 99.5% of the time. The specific inference results from plugging in for . The specific 0.995 lower limit = , and the specific 0.995 estimate is . This inference is qualified by the error probability of the method, namely the confidence level 0.995. But the upshot of this qualification is often misunderstood. Let’ s have a new example to show the duality between the lower confidence interval estimator and the generic ( α level) test T+ of form: H 0 : μ ≤ μ 0 against H 1 : μ > μ 0 . The “ accident at the water plant” has a nice standard error of 1, but that can mislead about the role of sample size n . Let σ = 1, n = 25, . (Even though we’ d actually have to estimate σ , the logic is the same and it’ s clearer.) I use σ /√ n rather than when a reminder of sample size seems needed.

Work backwards. Suppose we’ ve collected the 25 samples and observed sample mean . (The 0.6 has nothing to do with the polling example at the outset.) For what value of μ 0 would exceed μ 0 by Since , the answer is μ = 0.1 . If we were testing H 0 : μ ≤ 0.1 vs. H 1 : μ > 0.1 at level 0.005, we’ d reject with this outcome. The corresponding 0.995 lower estimate would be

μ > 0.1.

(see Note 1).

Now for the duality. is not statistically significantly greater than any μ value larger than 0.1 (e.g., 0.15, 0.2, etc.) at the 0.005 level. A test of form T+ would fail to reject each of the values in the CI interval at the 0.005 level, with . Since this is continuous, it does not matter if the cut-off is at 0.1 or greater than or equal to 0.1. 1 By contrast, if we were testing μ 0 values 0.1 or less (T+: H 0 : μ ≤ 0.1 against H 1 : μ > 0.1 ), these nulls would be rejected by at the 0.005 level (or even lower for values less than 0.1). That is, under the supposition that the data were generated from a world where H 0 : μ ≤ 0.1 , at least 99.5% of the time a smaller than what was observed (0.6) would occur:

The probability of observing would be low, 0.005.

Severity Fact (for test T+): Taking an outcome that just reaches the α level of significance ( α )as warranting H 1 : μ > μ 0 with severity (1 − α ) is mathematically the same as inferring at level (1 − α ).

Hence, there’ s an intimate mathematical relationship between severity and confidence limits. However, severity will break out of the fixed (1 − α ) level, and will supply a non-behavioristic rationale that is now absent from confidence intervals. 2

Severity and Capabilities of Methods

Begin with an instance of our “ Fact” : To take an outcome that just reaches the 0.005 significance level as warranting H 1 with severity 0.995, is the same as taking the observed and inferring μ just exceeds the 99.5 and lower confidence bound: μ > 0.1 . My justification for inferring μ > 0.1 (with = 0.6) is this. Suppose my inference is false. Take the smallest value that renders it false, namely μ = 0.1 . Were μ = 0.1 , then the test very probably would have resulted in a smaller observed than I got (0.6). That is, 99.5% of the time it would have produced a result less discordant with claim μ > 0.1 than what I observed. (For μ values less than 0.1 this probability is increased.) Given that the method was highly in capable of having produced a value of as large as 0.6, if μ ≤ 0.1 , we argue that there is an indication at least (if not full blown evidence) that μ > 0.1 . The severity with which μ > 0.1 “ passes” (or is indicated by) this test is approximately 0.995.

Some caveats: First, throughout this exercise, we are assuming these values are “ audited,” and the assumptions of the model permit the computations to be licit. Second, we recognize full well that we merely have a single case, and inferring a genuine experimental effect requires being able to produce such impressive results somewhat regularly. That’ s why I’ m using the word “ indication” rather than evidence. Interestingly though, you don’ t see the same admonition against “ isolated” CIs as with tests. (Rather than repeating these auditing qualifications, I will assume the context directs the interpretation.)

Severity versus Performance.

The severity interpretation differs from both the construals that are now standard in confidence interval theory: The first is the coverage probability construal, and the second I’ m calling rubbing-off . The coverage probability rationale is straightforwardly performance oriented. The rationale for the rule: infer

is simply that you will correctly cover the true value at least 99.5% of the time in repeated use (we can allow the repetitions to be actual or hypothetical):

Aside: The equation above is not treating μ as a random variable, although it might look that way. is the random variable. It’ s the same as asserting . Is this performance-oriented interpretation really all you can say? The severe tester says no. Here’ s where different interpretive philosophies enter.

Cox and Hinkley (1974 ) do not adhere to a single choice of 1 − α . Rather, to assert a 0.995 CI estimate, they say, is to follow:

… a procedure that would be wrong only in a proportion α of cases, in hypothetical repeated applications, whatever may be the true value μ . Note that this is a hypothetical statement that gives an empirical meaning, which in principle can be checked by experiment, rather than a prescription for using confidence limits. In particular, we do not recommend or intend that a fixed value α 0 should be chosen in advance and the information in the data summarized in the single assertion [ ].

(p. 209, μ is substituted for their θ )

We have the meaning versus application gap again, which severity strives to close. “ [W]e define procedures for assessing evidence that are calibrated by how they would perform were they used repeatedly. In that sense they do not differ from other measuring instruments” (Cox 2006a , p. 8). Yet this performance is not the immediate justification for the measurement in the case at hand. What I mean is, it’ s not merely that if you often use a telescope with good precision, your measurements will have a good track record – no more than with my scales (in Section 1.1 ). Rather, the thinking is, knowing how they would perform lets us infer how they’ re performing now. Good long-run properties “ rub-off” in some sense on the case at hand (provided at least they are the relevant ones).

It’ s not so clear what’ s being rubbed off. You can’ t say the probability it’ s correct in this case is 0.995, since either it’ s right or not. That’ s why “ confidence” is introduced. Some people say from the fact that the procedure is rarely wrong we may assign a low probability to its being wrong in the case at hand. First, this is dangerously equivocal, since the probability properly attaches to the method of inferring. Some espouse it as an informal use of “ probability” outside of statistics, for instance, that confidence is “ the degree of belief of a rational person that the confidence interval covers the parameter” (Schweder and Hjort 2016 , p. 11). They call this “ epistemic probability.” My main gripe is that neither epistemic probability, whatever it is, nor performance gives a report of well-testedness associated with the claim at hand.

By providing several limits at different values, we get a more informative assessment, sometimes called a confidence distribution (CD). An early reference is Cox (1958 ). “ The set of all confidence intervals at different levels of probability… [yields a] confidence distribution” (Cox 1958 , p. 363). We’ ll visit others later. The severe tester still wants to nudge the CD idea; whether it’ s a large or small nudge is unclear because members of CD tribes are unclear. By and large, they’ re either a tad bit too performance oriented or too close to a form of probabilism for a severe tester. Recall I’ ve said I don’ t see the severity construal out there, so I don’ t wish to saddle anyone with it. If that is what some CD tribes intend, great.

; The severity logic is the counterfactual reasoning: Were μ less than the 0.995 lower limit, then it is very probable (> 0.995) that our procedure would yield a smaller sample mean than 0.6. This probability gives the severity. To echo Popper, is corroborated (at level 0.995) because it may be presented as a failed attempt to falsify it statistically. The severe testing philosophy hypothesizes that this is how humans reason. It underwrites formal error statistics as well as day-to-day reasoning.

Exhibit (vii): Capability.

Let’ s see how severity is computed for the CI claim with :

1. The particular assertion h is .

2. accords with h , an assertion about a positive discrepancy from 0.1.

3. Values of less than 0.6 accord less well with h . So we want to compute the probability ( ) just at the point that makes h false: μ = 0.1 .

Pr(method would yield .

4. From (3), SEV( μ > 0.1) = 0.995 (or we could write ≥ , but our convention will be to write =).

Although we are moving between values of the parameter and values of , so long as we are careful, there is no illegitimacy. We can see that CI limits follow severity reasoning. For general lower 1 − α limits, with small level α :

The inference of interest is h : .

Since ,

it follows that SEV(h ) = (1 − α ).

(Lower case h emphasizes these are typically members of the full alternative in a test.) Table 3.5 gives several examples.

Table 3.5 Lower confidence limit with , α ranging from 0.001 to 0.84 in T+: σ = 1, n = 25, ,

α c α SEV(h 1 )

0.001 3 0 ( μ > 0) 0.001 0.999

0.005 2.5 0.1 ( μ > 0.1) 0.005 0.995

0.025 2 0.2 ( μ > 0.2) 0.025 0.975

0.07 1.5 0.3 ( μ > 0.3) 0.07 0.93

0.16 1 0.4 ( μ > 0.4) 0.16 0.84

0.3 0.5 0.5 ( μ > 0.5) 0.3 0.7

0.5 0 0.6 ( μ > 0.6) 0.5 0.5

0.69 −0.5 0.7 ( μ > 0.7) 0.69 0.31

0.84 −1 0.8 ( μ > 0.8) 0.84 0.16

Perhaps “ capability or incapability” of the method can serve to get at what’ s rubbing off. The specific moral I’ ve been leading up to can be read right off the Table, as we vary the value for α (from 0.001 to 0.84) and form the corresponding lower confidence bound from to .

The higher the test’ s capability to produce such large (or even larger) differences as we observe, under the assumption , the less severely tested is assertion . (See Figure 3.3 .)

The third column of Table 3.5 gives the complement to the severity assessment: the capability of a more extreme result, which in this case is α : . This is the Π function – the attained sensitivity in relation to μ : Π ( γ ) (section 3.3) – but there may be too many moving parts to see this simply right away. You can return to it later.

Figure 3.3 Severity for .

We do not report a single, but rather several confidence limits, and the corresponding inferences of form h 1 . Take the third row. The 0.975 lower limit that would be formed from , , is μ = 0.2 . The estimate takes the form μ > 0.2 . Moreover, the observed mean, 0.6, is statistically significantly greater than 0.2 at level 0.025. Since μ = 0.2 would very probably produce , the severe tester takes the outcome as a good indication of μ ≥ 0.2 . I want to draw your attention to the fact that the probability of producing an ranges from 0.005 to 0.5 for values of μ between 0.1 and the observed . It never exceeds 0.5. To see this compute letting μ ′ range from 0.1 to 0.6. We standardize to get which is N(0,1). To find , compute Z = (0.6 − μ ′ )/0.2 and use the areas under the standard Normal curve to get Pr(Z ≥ z 0 ), μ ′ ranging from 0.1 to 0.6.

Do you notice it is only for negative z values that the area to the right of z exceeds 0.5? The test only begins to have more than 50% capability of generating observed means as large as 0.6, when μ is larger than 0.6. An important benchmark enters. The lower 0.5 bound is 0.6. Since a result even larger than observed is brought about 50% of the time when μ = 0.6 , we rightly block the inference to μ > 0.6 .

Go to the next to last row: using a lower confidence limit at level 0.31! Now nobody goes around forming confidence bounds at level 0.5, let alone 0.31, but they might not always realize that’ s what they’ re doing! We could give a performance-oriented justification: the inference to μ > 0.7 from is an instance of a rule that errs 31% of the time. Or we could use counterfactual, severity reasoning: even if μ were only 0.7, we’ d get a larger than we observed a whopping 69% of the time. Our observed is terrible grounds to suppose μ must exceed 0.7. If anything, we’ re starting to get an indication that μ < 0.7 ! Observe that, with larger α , the argument is more forceful by emphasizing >, rather than ≥ , but it’ s entirely correct either way, as it is continuous.

In grasping the duality between tests and confidence limits we consider the general form of the test in question, here we considered T+. Given the general form, we imagine the test hypotheses varying, with a fixed outcome . Considering other instances of the general test T+ is a heuristic aid in interpreting confidence limits using the idea of statistical inference as severe testing. We will often allude to confidence limits to this end. However, the way the severe tester will actually use the duality is best seen as a post-data way to ask about various discrepancies indicated. For instance, in testing H 0 : μ ≤ 0 vs. H 1 : μ > 0 , we may wish to ask, post-data, about a discrepancy such as μ > 0.2 . That is, we ask, for each of the inferences, how severely passed it is.

Granted this interval estimator has a nice pivot. If I thought the nice cases weren’ t the subject of so much misinterpretation, I would not start there. But there’ s no chance of seeing one’ s way into more complex cases if we are still hamstrung by the simple ones. In fact, the vast majority of criticism and proposed reforms revolve around our test T+ and two-sided variants. If you grasp the small cluster of the cases that show up in the debates, you’ ll be able to extend the results. The severity interpretation enables confidence intervals to get around some of their current problems. Let’ s visit a few of them now. (See also Excursion 4 Tour II , Excursion 5 Tour II .)

Exhibit (viii): Vacuous and Empty Confidence Intervals: Howlers and Chestnuts.

Did you hear the one about the frequentist who reports a confidence level of 0.95 despite knowing the interval must contain the true parameter value?

Basis for the joke: it’ s possible that CIs wind up being vacuously true: including all possible parameter values. “ Why call it a 95 percent CI if it’ s known to be true?” the critics ask. The obvious, performance-based, answer is that the confidence level refers to the probability the method outputs true intervals; it’ s not an assignment of probability to the specific interval. It’ s thought to be problematic only by insisting on a probabilist interpretation of the confidence level. Jose Bernardo thinks that the CI user “ should be subject to some re-education using well-known, standard counterexamples. … conventional 0.95-confidence regions may actually consist of the whole real line” (Bernardo 2008 , p. 453). Not so.

Cox and Hinkley (1974, p.226) proposed interpreting confidence intervals, or their corresponding confidence limits (lower or upper), as the set of parameter values consistent at the confidence level.

This interpretation of confidence intervals also scotches criticisms of examples where, due to given restrictions, it can happen that a (1 − α ) estimate contains all possible parameter values. Although such an inference is ‘ trivially true,’ it is scarcely vacuous in our construal. That all parameter values are consistent with the data is an informative statement about the limitations of the data to detect discrepancies at the particular level.

(Cox and Mayo 2010 , p. 291)

Likewise it can happen that all possible parameter points are inconsistent with the data at the (1 − α ) level. Criticisms of “ vacuous” and empty confidence intervals stem from a probabilist construal of (1 − α ) as the degree of support, belief, or probability attached to the particular interval; but this construal isn’ t endorsed by CI interval me
thodology. There is another qualification to add: the error probability computed must be relevant. It must result from the relevant sampling distribution.

Pathological Confidence Set.

Here’ s a famous chestnut that is redolent of Exhibit (vi) in Section 3.4 (Cox’ s 1958 two measuring instruments with different precisions). It is usually put in terms of a “ confidence set” with n = 2. It could also be put in the form of a test. Either way, it is taken to question the relevance of error statistical assessments in the case at hand (e.g., Berger and Wolpert 1988 , Berger 2003 , p. 6). Two independent and identically distributed observations are to be made represented by random variables X 1 , X 2 . Each X can take either value ψ − 1 or ψ + 1 with probability of 0.5, where ψ is the unknown parameter to be estimated using the data. The data can result in both outcomes being the same, or both different.

Consider the second case: With both different, we know they will differ by 2. A possibility might be ⟨ 9, 11⟩ . Right away, we know ψ must be 10. What luck! We know we’ re right to infer ψ is 10. To depict this case more generally, the two outcomes are x 1 = x ′ − 1 and x 2 = x ′ + 1 , for some value x ′ .

Consider now that the first case obtains. We are not so lucky. The two outcomes are the same: x 1 = x 2 (maybe they’ re both 9 or whatever). What should we infer about the value of parameter ψ ? We know ψ is either x 1 − 1 or x 1 + 1 (e.g., 8 or 10); each accords equally well with the data. Say we infer ψ is x 1 − 1 . The method is correct with probability 0.5. Averaging over the two possibilities, the probability of an erroneous inference is 0.25. Now suppose I was lucky and observed two different outcomes. Then I know the value of ψ so it makes no sense to infer “ ψ is (x 1 + x 2 )/2 ” while attaching a confidence coefficient of 0.75.

‹ Prev Next ›