Statistical Inference as Severe Testing
Page 48
That is, if the attained power (att-power) of T+ against μ ≤ μ 0 + γ is very high, the inference to μ ≤ μ 0 + γ is warranted with severity. (As always, whether it’ s a mere indication or genuine evidence depends on whether it passes an audit.) If you elect to use the term attained power, you’ ll have to avoid confusing it with animals given similar names; I’ ll introduce you to them shortly.
Compare power analytic reasoning with severity (or att-power) reasoning from a negative or insignificant result from T+.
Power Analysis : If Pr(d( X ) ≥ c α ; µ 1 ) = high and the result is not significant, then it’ s an indication or evidence that µ ≤ µ 1 .
Severity Analysis : If Pr(d( X ) ≥ d( x 0 ); µ 1 ) = high and the result is not significant, then it’ s an indication or evidence that µ ≤ µ 1 .
Severity replaces the predesignated cut-off c α with the observed d( x 0 ). Thus we obtain the same result if we choose to remain in the Fisherian tribe, as seen in the Frequentist Evidential Principle FEV(ii) (Section 3.3 ).
We still abide by the logic of power analysis, since if Pr(d( X ) ≥ d( x 0 ); µ 1 ) = high, then Pr(d( X ) ≥ c α ; µ 1 ) = high, at least in a test with a sensible distance measure like T+. In other words, power analysis is conservative. It gives a sufficient but not a necessary condition for warranting an upper bound: µ ≤ µ 1 . But it can be way too conservative as we just saw.
(1) Pr(d( X ) ≥ c α ; μ = μ 0 + γ ): Power to detect γ .
Ordinary power analysis requires (1) to be high (for non-significance to warrant μ ≤ μ 0 + γ ).
Just missing the cut-off c α is the worst case. It is more informative to look at (2):
(2) Pr(d( X ) ≥ d( x 0 ); μ = μ 0 + γ ): Attained power (Π ( γ )).
(1) can be low while (2) is high. The computation in (2) measures the severity (or degree of corroboration) for the inference μ ≤ μ 0 + γ . The analysis with Cox kept to Π ( γ ) (or “ attained sensitivity” ), keeping “ power” out of it.
As an entrée to Exhibit (iv) : Isn’ t severity just power? This is to compare apples and frogs. The power of a test to detect an alternative is an error probability of a method (one minus the probability of the corresponding Type II error). Power analysis is a way of using power to assess a statistical inference in the case of a negative result. Severity, by contrast, is always to assess a statistical inference. Severity is always in relation to a particular claim or inference C , from a test T and an outcome x . So with that out of the way, what if the question is put properly thus: If a result from test T+ is just statistically insignificant at level α , then is the test’ s power to detect μ 1 equal to the severity for inference C : µ > µ 1 ? The answer is no. It would be equal to the severity for inferring the denial of C ! See Figure 5.4 comparing SEV(µ > µ 1 ) and POW(µ 1 ) .
Figure 5.4 Severity for (µ > µ 1 ) vs power (µ 1 ).
Figure 5.5 The observed difference is 90, each group has n = 200 patients, and the standard deviation is 450.
Exhibit (iv): Difference Between Means Illustration.
I’ ve been making use of Senn’ s points about the nonsensical and the ludicrous in motivating the severity assessment in relation to power. I want to show you severity for a difference between means, and fortuitously Senn (2019) alludes to severity in the latest edition of his textbook to make the same point. An example is a placebo-controlled trial in asthma where the amount of air a person can exhale in one second, the forced expiratory volume, is measured. “ The clinically relevant difference is presumed to be 200 ml and the standard deviation 450 ml” (p. 197). It’ s a test of
H 0 : δ = μ 1 − μ 0 ≤ 0 vs. H 1 : δ > 0.
He will use a one-sided test at the 2.5% level, using 200 patients in each group yielding the standard error (SE) for the difference between two means equal to . 6 The test has over 0.99 power for detecting δ = 200 . The observed difference d = , which is statistically significant at the 0.025 level (90/SE = 2), but it wouldn’ t warrant inferring δ > 200 , and we can see the severity for δ > 200 is extremely low. The observed difference is statistically significantly different from H 0 : in accord with (or in the direction of) H 1 , so severity computes the probability of a worse fit with H 1 under δ = 200:
SEV( δ > 200) = Pr(d < 90; δ = 200).
While this is strictly a Student’ s t distribution, given the large sample size, we can use the Normal distribution: 7
SEV( δ > 200) = the area to the left of −2.44 (See Figure 5.5 ) which is 0.007,
For a few examples, SEV( δ > 100) = 0.412 the area to the left of z = (90 − 100)/45 = −0.222. SEV( δ > 50) = 0.813 the area to the left of z = (90 − 50)/ 45 = 0.889.
SEV( δ > 10) = 0.962 the area to the left of z = (90 − 10)/45 = 1.778.
Senn gives another way to view the severity assessment of δ > δ ′ , namely “ adopt [ δ = δ ′ ], as a null hypothesis and then turn the significance test machinery on it (2019). In the case of testing δ = 200, the P -value would be 0.999. Scarcely evidence against it. We first visited this mathematical link in touring the connection between severity and confidence intervals in Section 3.8 . As noted, the error statistician is loath to advocate modifying the null hypothesis because the point of a severity assessment is to supply a basis for interpreting tests that is absent in existing tests. Since significance tests aren’ t explicit about assessing discrepancies, and since the rationale for P -values is questioned in all the ways we’ ve espied, it’ s best to supply a fresh rationale. I have offered the severity rationale as a basis for understanding, if not buying, error statistical reasoning. The severity computation might be seen as a rule of thumb to avoid misinterpretations; it could be arrived at through other means, including varying the null hypotheses. It’ s the idea of viewing statistical inference as severe testing that invites a non-trivial difference with probabilism.
5.4 Severity Interpretation of Tests: Severity Curves
We visit severity tribes who have prepared an overview that synthesizes non-significant results from Fisherian as well as (“ do not reject” ) results from N-P tests. Following the minimal principle of severity:
(a) If data d( x ) are not statistically significantly different from H 0 , but the capability of detecting discrepancy γ is low, then d( x ) is not good evidence that the actual discrepancy is less than γ .
What counts as a discrepancy “ of interest” is a separate question, outside of statistics proper. You needn’ t know it to ask: What discrepancies, if they existed, would very probably have led your method to show a more significant result than you found? Upon finding this, you may infer that, at best, the test can rule out increases of that extent.
(b) If data d( x ) are not statistically significantly different from H 0 , but the probability to detect discrepancy γ is high, then x is good evidence that the actual discrepancy is less than or equal to γ .
We are not changing the original null and alternative hypotheses! We’ re using the severe testing concept to interpret the negative results – the kind of scrutiny in which one might be interested, to follow Neyman, “ when we are faced with … interpreting the results of an experiment planned and performed by someone else” (Neyman 1957b , p. 15). We want to know how well tested are claims of form μ ≤ μ 1 , where μ 1 = ( μ 0 + γ ) , for some γ ≥ 0 .
Why object to applying the severity analysis by changing the null hypothesis, and doing a simple P -value computation? P -values, especially if plucked from thin air this way, are themselves in need of justification. That’ s a major goal of this journey. It’ s only by imagining we have either a best or good test or corresponding distance measure (let alone assuming we don’ t have to deal with lots of nuisance parameters) that substituting different null hypotheses works out.
Pre-data, we need a test with good error probabilities (as discussed in Section 3.2 ). That assures we avoid some worst case. Post-data we go further.
For a claim H to pass with severity requires not just that (S-1) the data accord wit
h H , but also that (S-2) the test probably would have produced a worse fit, if H were false in specified ways. We often let the measure of accordance (in (S-1)) vary and train our critical focus on (S-2), but here it’ s a best test. Consider statistically insignificant results from test T+. The result “ accords with” H 0 , so we have (S-1), but we’ re wondering about (S-2): how probable is it that test T+ would have produced a result that accords less well with H 0 than x 0 does, were H 0 false? An equivalent but perhaps more natural phrase for “ a result that accords less well with H 0 ” is “ a result more discordant.” Your choice.
Souvenir W: The Severity Interpretation of Negative Results (SIN) for Test T+
Applying our general abbreviation: SEV(test T+ , outcome x , inference H ), we get “ the severity with which μ ≤ μ 1 passes test T+, with data x 0 ” :
SEV(T+, d( x 0 ), μ ≤ μ 1 ),
where μ 1 = ( μ 0 + γ ), for some γ ≥ 0. If it’ s clear which test we’ re discussing, we use our abbreviation: SEV( μ ≤ μ 1 ). We obtain a companion to the severity interpretation of rejection (SIR), Section 4.4 , Souvenir R:
SIN (Severity Interpretation for Negative Results)
(a) If there is a very low probability that d( x 0 ) would have been larger than it is, even if μ > μ 1 , then μ ≤ μ 1 passes with low severity: SEV( μ ≤ μ 1 ) is low.
(b) If there is a very high probability that d( x 0 ) would have been larger than it is, were μ > μ 1 , then μ ≤ μ 1 passes the test with high severity: SEV( μ ≤ μ 1 ) is high.
To break it down, in the case of a statistically insignificant result:
SEV( μ ≤ μ 1 ) = Pr(d( X ) > d( x 0 ); μ ≤ μ 1 false).
We look at {d( X ) > d( x 0 )} because severity directs us to consider a “ worse fit” with the claim of interest. That μ ≤ μ 1 is false within our model means that μ > μ 1 . Thus:
SEV( μ ≤ μ 1 ) = Pr(d( X ) > d( x 0 ); μ > μ 1 ).
Now μ > μ 1 is a composite hypothesis, containing all the values in excess of μ 1 . How can we compute it? As with power calculations, we evaluate severity at a point μ 1 = ( μ 0 + γ ) , for some γ ≥ 0 , because for values μ ≥ μ 1 the severity increases. So we need only to compute
SEV( μ ≤ μ 1 ) > Pr(d ( X ) > d( x 0 ); μ = μ 1 ).
To compute SEV we compute Pr(d ( X ) > d( x 0 ); μ = μ 1 ) for any μ 1 of interest. Swapping out the claims of interest (in significant and insignificant results), gives us a single criterion of a good test, severity.
Exhibit(v): Severity Curves.
The severity tribes want to present severity using a standard Normal example, one where (as in the water plant accident). For this illustration:
If α = 0.025, we reject H 0 iff d( X ) ≥ c 0.025 =1.96.
Suppose test T+ yields the statistically insignificant result d( x 0 ) = 1.5 . Under the alternative d( X ) is N( δ , 1) where .
Even without identifying a discrepancy of importance ahead of time, the severity associated with various inferences can be evaluated.
The severity curves (Figure 5.6 ) show d( x 0 ) = 0.5 , 1, 1.5, and 1.96.
The easiest way to compute it is to go back to the observed , which would be 1.5.
Here, Z = [(1.5 − 0.5)/1] = 1 , and the area under the standard Normal distribution to the right of 1 is 0.16. Lousy. We can read it off the curve, looking at where the d( x ) = 1.5 curve hits the bottom-most dotted line. The severity (vertical) axis hits 0.16, and the corresponding value on the μ axis is 0.5. This could be used more generally as a discrepancy axis, as I’ ll show.
Figure 5.6 Severity curves.
We can find some discrepancy from the null that this statistically insignificant result warrants ruling out at a reasonable level – one that very probably would have produced a more significant result than was observed. The value d( x 0 ) = 1.5 yields a severity of 0.84 for a discrepancy of 2.5. SEV( μ ≤ 2.5) = 0.84 with d( x 0 ) = 1.5 . Compare this to d( x ) = 1.95 , failing to trigger the significance alarm. Now a larger upper bound is needed for severity 0.84, namely, μ ≤ 2.95 . If we have discrepancies of interest, by setting a high power to detect them, we ensure – ahead of time – that any insignificant result entitles us to infer “ it’ s not that high.” Power against μ 1 evaluates the worst (i.e., lowest) severity for values μ ≤ μ 1 for any outcome that leads to non-rejection. This test ensures any insignificant result entitles us to infer μ ≤ 2.95, call it μ ≤ 3 . But we can determine discrepancies that pass with severity, post-data, without setting them at the outset. Compare four different outcomes:
d( x 0 ) = 0.5, SEV( μ ≤ 1.5) = 0.84; d( x 0 ) = 1, SEV( μ ≤ 2) = 0.84;
d( x 0 ) = 1.5, SEV( μ ≤ 2.5) = 0.84; d( x 0 ) = 1.95, SEV( μ ≤ 2.95) = 0.84.
In relation to test T+ (standard Normal): If you add 1 to d( x 0 ), the result being μ 1 , then SEV( μ ≤ μ 1 ) = 0.84 . 8
We can also use severity curves to compare the severity for a given claim, say μ ≤ 1.5:
d( x 0 ) = 0.5, SEV( μ ≤ 1.5) = 0.84; d( x 0 ) = 1, SEV( μ ≤ 1.5) = 0.7;
d( x 0 ) = 1.5, SEV( μ ≤ 1.5) = 0.5; d( x 0 ) = 1.95, SEV( μ ≤ 1.5) = 0.3.
Low and high benchmarks convey what is and is not licensed, and suffice for avoiding fallacies of acceptance. We can deduce SIN from the case where T+ has led to a statistically significant result, SIR. In that case, the inference that passes the test is of form μ > μ 1 , where μ 1 = μ 0 + γ . Because ( μ > μ 1 ) and ( μ ≤ μ 1 ) partition the parameter space of μ , we get SEV( μ > μ 1 ) = 1 − SEV( μ ≤ μ 1 ) .
The more devoted amongst you will want to improve and generalize my severity curves. Some of you are staying the night at Confidence Court Inn, others at Best Bet and Breakfast. We meet at the shared lounge, Calibation. Here’ s a souvenir of SIR, and SIN.
Souvenir X: Power and Severity Analysis
Let’ s record some highlights from Tour I:
First, ordinary power analysis versus severity analysis for Test T+:
Ordinary Power Analysis : If Pr(d(X ) ≥ c α ; μ 1 ) = high and the result is not significant, then it’ s an indication or evidence that µ ≤ μ 1 .
Severity Analysis : If Pr(d(X ) ≥ d( x 0 ); μ 1 ) = high and the result is not significant, then it’ s an indication or evidence that µ ≤ μ 1 .
It can happen that claim µ ≤ μ 1 is warranted by severity analysis but not by power analysis.
Now an overview of severity for test T+: Normal testing: H 0 : μ ≤ μ 0 vs. H 1 : μ > μ 0 with σ known. The severity reinterpretation is set out using discrepancy parameter γ . We often use μ 1 where μ 1 = μ 0 + γ .
Reject H 0 (with x 0 ) licenses inferences of the form μ > [ μ 0 + γ ] , for some γ ≥ 0 , but with a warning as to μ ≤ [ μ 0 + κ ] , for some κ ≥ 0.
Non-reject H 0 (with x 0 ) licenses inferences of the form μ ≤ [ μ 0 + γ ] , for some γ ≥ 0 , but with a warning as to values fairly well indicated μ > [ μ 0 + κ ] , for some κ ≥ 0 .
The severe tester reports the attained significance levels and at least two other benchmarks: claims warranted with severity, and ones that are poorly warranted.
Talking through SIN and SIR. Let d 0 = d ( x 0 ) .
SIN (Severity Interpretation for Negative Results)
(a) low : If there is a very low probability that d 0 would have been larger than it is, even if μ > μ 1 , then μ ≤ μ 1 passes with low severity: SEV( μ ≤ μ 1 ) is low (i.e., your test wasn’ t very capable of detecting discrepancy μ 1 even if it existed, so when it’ s not detected, it’ s poor evidence of its absence).
(b) high : If there is a very high probability that d 0 would have been larger than it is, were μ > μ 1 , then μ ≤ μ 1 passes the test with high severity: SEV( μ ≤ μ 1 ) is high (i.e., your test was highly capable of detecting discrepancy μ 1 if it existed, so when it’ s not detected, it’ s a good indication of its absence).
SIR (Severity Interpretation for Significant Results)
If the significance level is small, it’ s indicat
ive of some discrepancy from H 0 , we’ re concerned about the magnitude:
(a) low : If there is a fairly high probability that d 0 would have been larger than it is, even if μ = μ 1 , then d 0 is not a good indication μ > μ 1 : SEV( μ > μ 1 ) is low. 9
(b) high : Here are two ways, choose your preferred:
• (b-1) If there is a very high probability that d 0 would have been smaller than it is, if μ ≤ μ 1 , then when you observe so large a d 0 , it indicates μ > μ 1 : SEV( μ > μ 1 ) is high.
• (b-2) If there’ s a very low probability that so large a d 0 would have resulted, if μ were no greater than μ 1 , then d 0 indicates μ > μ 1 : SEV( μ > μ 1 ) is high. 10
1 We saw this in Section 4.5 . Goodman also shows the results for the maximum likely alternative.
2 He writes the two descriptions as p = α vs. p > α , but I think it’ s clearer using corresponding z values.
3 In private conversation.
4 A key medical paper is Freiman et al. (1978 ).
5 To show, as crude power analysis does not, that when : We standardize to get z = (−0.2 − 0.2)/0.2 = −2 and so .
6 From the general case, we get the case where n 1 = n 2 ,