Statistical Inference as Severe Testing

Page 49

by Deborah G Mayo

7

8 If you add k to d( x 0 ), k > 0, the result being μ 1 , then SEV( μ ≤ μ 1 ) = area to the right of −k under the standard Normal (SEV > 0.5) .

If you subtract k from d( x 0 ), the result being μ 1 , then SEV( μ ≤ μ 1 ) = area to the right of k under the standard Normal (SEV ≤ 0.5) .

For the general case of Test T+, you’ d be adding or subtracting k to . We know that adding 0.85 , 1 , and 1.28 to the cut-off for rejection in a test T+ results in μ values against which the test has 0.8, 0.84, and 0.9 power. If you treat the observed as if it were being contemplated as the cut-off, and add 0.85 , 1 , and 1.28 , you will arrive at μ 1 values such that SEV( μ ≤ μ 1 ) = 0.8, 0.84 , and 0.9, respectively. That’ s because severity goes in the same direction as power for non-rejection in T+. For familiar numbers of ’ s added/subtracted to :

Claim

SEV 0.16 0.5 0.84 0.95 0.975

9 A good rule of thumb to ascertain if a claim C is warranted is to think of a statistical modus tollens argument, and find what would occur with high probability, were claim C false.

10 For a shorthand that covers both severity and FEV for Test T+ with small significance level (Section 3.1 ):

(FEV/SEV): If d( x 0 ) is not statistically significant, then passes the test T+ with severity (1 − ε )

(FEV/SEV): If d( x 0 ) is statistically significant, then passes test T+ with severity (1 − ε ),

where Pr(d( X ) > k ε ) = ε (Mayo and Spanos (2006), Mayo and Cox (2006).)

Tour II

How Not to Corrupt Power

5.5 Power Taboos, Retrospective Power, and Shpower

Let’ s visit some of the more populous tribes who take issue with power – by which we mean ordinary power – at least its post-data uses. Power Peninsula is often avoided due to various “ keep out” warnings and prohibitions, or researchers come during planning, never to return. Why do some people consider it a waste of time, if not totally taboo, to compute power once we know the data? A degree of blame must go to N-P, who emphasized the planning role of power, and only occasionally mentioned its use in determining what gets “ confirmed” post-data. After all, it’ s good to plan how large a boat we need for a philosophical excursion to the Lands of Overlapping Statistical Tribes, but once we’ ve made it, it doesn’ t matter that the boat was rather small. Or so the critic of post-data power avers. A crucial disanalogy is that with statistics, we don’ t know that we’ ve “ made it there,” when we arrive at a statistically significant result. The statistical significance alarm goes off, but you are not able to see the underlying discrepancy that generated the alarm you hear. The problem is to make the leap from the perceived alarm to an aspect of a process, deep below the visible ocean, responsible for its having been triggered. Then it is of considerable relevance to exploit information on the capability of your test procedure to result in alarms going off (perhaps with different decibels of loudness), due to varying values of the parameter of interest. There are also objections to power analysis with insignificant results.

Exhibit (vi): Non-significance + High Power Does Not Imply Support for the Null over the Alternative.

Sander Greenland (2012 ) has a paper with this title. The first step is to understand the assertion, giving the most generous interpretation. It deals with non-significance, so our ears are perked for a fallacy of non-rejection. Second, we know that “ high power” is an incomplete concept, so he clearly means high power against “ the alternative.” We have a handy example: alternative μ .84 in T+ (POW(T+, μ .84 ) = 0.84) . Use the water plant case, T+: H 0 : μ ≤ 150 vs. H 1 : μ > 150 , σ = 10 , n = 100 . With α = 0.025, z 0.025 = 1.96 , and the corresponding cut-off in terms of is [150 + 1.96(10)/√ 100] = 151.96 ], μ .84 = 152.96 .

Now a title like this is supposed to signal a problem, a reason for those “ keep out” signs. His point, in relation to this example, boils down to noting that an observed difference may not be statistically significant – may fail to make it to the cut-off – and yet be closer to μ .84 than to 0. This happens because the Type II error probability β (here, 0.16) 1 is greater than the Type I error probability (0.025).

For a quick computation let and μ .84 = 153 . Halfway between alternative 153 and the 150 null is 151.5. Any observed mean greater than 151.5 but less than the cut-off, 152, will be an example of Greenland’ s phenomenon. An example would be those values that are closer to 153, the alternative against which the test has 0.84 power, than to 150 and thus, by a likelihood measure, support 153 more than 150 – even though POW( μ = 153) is high (0.84). Having established the phenomenon, your next question is: so what?

It would be problematic if power analysis took the insignificant result as evidence for μ = 150 – maintaining compliance with the ecological stipulation – and I don’ t doubt some try to construe it as such, nor that Greenland has been put in the position of needing to correct them. Power analysis merely licenses μ ≤ μ .84 where 0.84 was chosen for “ high power.” Glance back at Souvenir X. So at least one of the “ keep out” signs can be removed.

Shpower and Retrospective Power Analysis

It’ s unusual to hear books condemn an approach in a hush-hush sort of way without explaining what’ s so bad about it. This is the case with something called post hoc power analysis, practiced by some who live on the outskirts of Power Peninsula. Psst, don’ t go there. We hear “ there’ s a sinister side to statistical power, … I’ m referring to post hoc power” (Cumming 2012 , pp. 340– 1), also called observed power and retrospective (retro) power. I will be calling it shpower analysis . It distorts the logic of ordinary power analysis (from insignificant results). The “ post hoc” part comes in because it’ s based on the observed results. The trouble is that ordinary power analysis is also post-data. The criticisms are often wrongly taken to reject both.

Shpower evaluates power with respect to the hypothesis that the population effect size (discrepancy) equals the observed effect size, for example, that the parameter μ equals the observed mean. In T+ this would be to set . Conveniently, their examples use variations on test T+. We may define:

The thinking, presumably, is that, since we don’ t know the value of μ , we might use the observed to estimate it, and then compute power in the usual way, except substituting the observed value. But a moment’ s thought shows the problem – at least for the purpose of using power analysis to interpret insignificant results. Why?

Since alternative μ is set equal to the observed , and is given as statistically insignificant, we know we are in Case 1 from Section 5.1 : the power can never exceed 0.5. In other words, since , the shpower = ). But power analytic reasoning is all about finding an alternative against which the test has high capability to have rung the significance bell, were that the true parameter value – high power. Shpower is always “ slim” (to echo Neyman) against such alternatives. Unsurprisingly, then, shpower analytical reasoning has been roundly criticized in the literature. But the critics think they’ re maligning power analytic reasoning.

Now we know the severe tester insists on using attained power Pr(d( X ) ≥ d ( x 0 ); µ ′ ) to evaluate severity, but when addressing the criticisms of power analysis, we have to stick to ordinary power: 2

Ordinary power POW(µ ′ ): Pr(d( X ) ≥ c α ; µ ′ )

An article by Hoenig and Heisey (2001 ) (“ The Abuse of Power” ) calls power analysis abusive. Is it? Aris Spanos and I say no (in a 2002 note on them), but the journal declined to publish it. Since then their slips have spread like kudzu through the literature.

Howlers of Shpower Analysis

Hoenig and Heisey notice that within the class of insignificant results, the more significant the observed is, the higher the “ observed power” against , until it reaches 0.5 (when reaches and becomes significant). “ That’ s backwards!” they howl. It is backwards if “ observed power” is defined as shpower. Because, if you were to regard higher shpower as indicating better evidence for the null, you’ d be saying t
he more statistically significant the observed difference (between and μ 0 ), the more the evidence of the absence of a discrepancy from the null hypothesis μ 0 . That would contradict the logic of tests.

Two fallacies are being committed here. The first we dealt with in discussing Greenland: namely, supposing that a negative result, with high power against μ 1 , is evidence for the null rather than merely evidence that μ ≤ μ 1 . The more serious fallacy is that their “ observed power” is shpower. Neither Cohen nor Neyman define power analysis this way. It is concluded that power analysis is paradoxical and inconsistent with P -value reasoning. You should really only conclude that shpower analytic reasoning is paradoxical. If you’ ve redefined a concept and find that a principle that held with the original concept is contradicted, you should suspect your redefinition. It might have other uses, but there is no warrant to discredit the original notion.

The shpower computation is asking: What’ s the probability of getting , under ? We still have that the larger the power (against ), the better indicates that – as in ordinary power analysis – it’ s just that the indication is never more than 0.5. Other papers and even instructional manuals (Ellis 2010 ) assume shpower as what retrospective power analysis must mean, and ridicule it because “ a nonsignificant result will almost always be associated with low statistical power” (p. 60). Not so. I’ m afraid that observed power and retrospective power are all used in the literature to mean shpower. What about my use of severity? Severity will replace the cut-off for rejection with the observed value of the test statistic (i.e., Pr(d( X ) ≥ d ( x 0 ); μ 1 )) , but not the parameter value μ . You might say, we don’ t know the value of μ 1 . True, but that doesn’ t stop us from forming power or severity curves and interpreting results accordingly. Let’ s leave shpower and consider criticisms of ordinary power analysis. Again, pointing to Hoenig and Heisey’ s article (2001 ) is ubiquitous.

Anything Tests Can Do CIs Do Better

CIs do anything better than tests … No they don’ t, yes they do … Annie Get Your Gun is one of my favorite musicals, and while we’ ve already seen the close connection between confidence limits and severity, they do not usurp tests. Hoenig and Heisey claim that power, by which they now mean ordinary power, is superfluous – once you have confidence intervals. We focused on CIs with a significant result (Section 4.3, Exhibit (vi) ); our example now is a non-significant result. Let’ s admit right away that error statistical computations are interrelated, and if you have the correct principle directing you, you could get the severity computations by other means. The big deal is having the correct principle directing you, and this we’ ll see is what Hoenig and Heisey are missing.

Hoenig and Heisey consider an instance of our test T+: a one-sided Normal test of H 0 : μ ≤ 0 vs. H 1 : μ > 0 . The best way to address a criticism is to use the numbers given: “ One might observe a sample mean with . Thus Z = 1.4 and P = 0.08 , which is not significant at α = 0.05” (ibid., p. 3). They don’ t tell us the sample size n , it could be that σ = 5 and n = 25, or any other combination to yield ( σ /√ n ) = 1 . Since the P -value is 0.08, (Pr(Z > 1.4; μ = 0) = 0.081) , this is not significant at the 0.05 level (which requires z = 1.65), leading to a failure to reject H 0 . They then point out that the power against μ = 3.29 is high, 0.95 (i.e., Pr(Z > 1.645; μ = 3.29) = 0.95) . 3 Thus the power analyst would take the result as indicating μ < 3.29. So what’ s the problem according to them?

They note that a 95% upper confidence bound on μ would be 3.05 (1.4 + 1.65), the implication being that it is more informative than what is given by the conservative power analysis. True, they get a tighter upper bound using the observed insignificant result, just as we do with severity. This they take to show that, “ once we have constructed a confidence interval, power calculations yield no additional insights” (ibid., p. 4). Superfluous. There’ s one small problem: this is not the confidence interval that corresponds to test T+. The 95% confidence interval corresponding to test T+ is a one-sided interval: , not μ < 3.05. That is, it corresponds to a one-sided lower bound, not an upper bound.

From the duality between CIs and tests (Section 3.7), as Hoenig and Heisey correctly state, “ all values covered by the confidence interval could not be rejected” (ibid.). More specifically, the confidence interval contains the values that could not be rejected by the given test at the specified level of significance (Neyman 1937 ). But μ < 3.045 does not give the set of values that T+ would fail to reject, were those values substituted for 0 in the null hypothesis of T+; there are plenty of μ values less than 3.045 that would reject, were they the ones tested, for instance, μ < −1. The CI corresponding to test T+, namely, μ exceeds the lower confidence bound, doesn’ t help with the fallacy of insignificant results – the fallacy at which power analysis is aimed.

We don’ t deny it’ s useful to look at an upper bound (e.g., 3.05) to avoid the fallacy of non-rejection, just as it was to block fallacies of rejection (Section 4.3 ), but there needs to be a principled basis for this move, that’ s what severity gives us. Power analysis is a variation on the severity principle where . But Hoenig and Heisey are at pains to declare power analysis superfluous! They plainly cannot have it both ways – they must either supplement confidence intervals with an adjunct along severity lines or be left with no way to avoid fallacies of insignificant results with the test they consider. Such an adjunct would require relinquishing their assertion: “ It would be a mistake to conclude that the data refute any value within the confidence interval” (ibid., p. 4). The one-sided interval is [−0.245, infinity). We assume, of course, they don’ t literally mean “ refute.”

Now maybe they (or you) will say I’ m being unfair, that one should always do a two-sided interval (corresponding to a two-sided test). But they are keen to argue that power analysis is superfluous for interpreting insignificant results from tests. Suppose we chuck tests and always do two-sided 1 – α confidence intervals. We are still left with inadequacies already noted: First, the justification is purely performance: that the interval was obtained from a procedure with good long-run coverage; second, it relies on choosing a single confidence level and reporting, in effect, whether parameter values are inside or outside. Too dichotomous. Most importantly: The warrant for the confidence interval is just the one given by using attained power in a severity analysis. If this is right, it would make no sense for a confidence interval advocate to reject a severity analysis. You can see this revisiting Section 3.7 on capability and severity. (An example on bone sorting is Byrd 2008).

Inconclusive?

Not only do we get an inferential construal of confidence intervals that differentiates the points within the interval rather than treating them all as on a par, we avoid a number of shortcomings of confidence intervals. Here’ s one: It is commonly taught that if a 1 – α confidence interval contains both the null and a threshold value of interest, then only a diagnosis of “ inconclusive” is warranted. While the inconclusive reading may be a reasonable rule of thumb in some cases, it forfeits distinctions that even ordinary significance levels and power analyses can reveal, if they are not limited to one fixed level. Ecologist Mark Burgman (2005, p. 341) shows how a confidence interval on the decline of threatened species reports the results as inconclusive, whereas a severity assessment shows non-trivial evidence of decline.

Go back to Hoenig and Heisey and . Their two-sided 95% interval would be [– 0.245, 3.04]. Suppose one were quite interested in a μ value in excess of 0.4. Both 0 and 0.4 are in the confidence interval. Are the results really uninformative about 0.4? Recognizing the test would fairly often (84% of the time) get such an insignificant result even if μ were as large as 0.4 should lead us to say no. Dichotomizing parameter values as rejected or not, as they do, turns the well-known arbitrariness in prespecifying confidence levels into an invidious distinction. Thus, we should deny Hoenig and Heisey’ s allegation that power analysis is “ logically doomed” (p. 22), while endorsing a more nuanced use of both tests an
d intervals as in a severity assessment.

Our next exhibit looks at retrospective power in a different manner, and in relation, not to insignificant, but to significant results. It’ s not an objection to power analysis, but it appears to land us in a territory at odds with severity (as well as CIs and tests).

Exhibit (vii): Gelman and Carlin (2014 ) on Retrospective Power.

They agree with the critiques of performing post-experiment power calculations (which are really shpower calculations), but consider “ retrospective design analysis to be useful … in particular when apparently strong (statistically significant) evidence for nonnull effects has been found” (ibid., p. 2). They worry about “ magnitude error,” essentially our fallacy of making mountains out of molehills (MM). Unlike shpower, they don’ t compute power in relation to the observed effect size, but rather “ on an effect size that is determined from literature review or other information external to the data at hand” (ibid.). They claim if you reach a just statistically significant result, yet the test had low power to detect a discrepancy from the null that is known from external sources to be correct, then the result “ exaggerates” the magnitude of the discrepancy. In particular, when power gets much below 0.5, they say, statistically significant findings tend to be much larger in magnitude than true effect sizes. By contrast, “ if the power is this high [.8], … overestimation of the magnitude of the effect will be small” (ibid., p. 3).

From the MM Fallacy, if POW( μ 1 ) is high then a just significant result is poor evidence that μ >

‹ Prev Next ›