Statistical Inference as Severe Testing

Page 47

by Deborah G Mayo

Let’ s have a look at (1): comparing the likelihoods of μ 0 and μ .9 given z = 1.96. Why look at an alternative so far away from μ 0 ? The value μ .9 gets a low likelihood, even given statistically significant z α supplying a small denominator in (1). A classic issue with Bayes factors or likelihood ratios is the ease of finding little evidence against a null, and even evidence for it, by choosing an appropriately distant alternative. But μ .9 is a “ typical choice” in epidemiology, Goodman notes (ibid., p. 491). Sure it’ s a discrepancy we want high power to detect. That’ s different from using it to form a comparative likelihood. As Senn remarks, it’ s “ ludicrous” to consider just the null and μ .9 , which scarcely exhaust the parameter space, and “ nonsense” to take a single statistically significant result as even decent evidence for the discrepancy we should not like to miss. μ .9 is around 1.28 standard errors from the one-sided cut-off: . We know right away that any μ value in excess of the observed statistically significant z α is one we cannot have good evidence for. We can only have decent evidence that μ exceeds values that would form the CI lower bound for a confidence (or severity) level at least 0.7, 0.8, 0.9, etc. That requires subtracting from the observed data, not adding to it. Were the data generated by μ .9 , then 90% of the time we’ d have gotten a larger difference than we observed. Goodman might reply that I am using error probability criteria to judge his Bayesian analysis; but his Bayesian analysis was to show the significance test exaggerates evidence against the test hypothesis H 0 . To the tester, it’ s his analysis that’ s exaggerating by giving credibility 0.75 to μ .9 . Perhaps it makes sense with the 0.5 prior, but does that make sense?

Assigning 0.75 to the alternative μ .9 (using the likelihood ratio in (1)) does not convey that this is terrible evidence for a discrepancy that large. Granted, the posterior for μ .9 would be even higher using the tail areas in (2), namely, 0.97 – and that is his real point, which I’ ve yet to consider. He’ s right that using (2) in his Bayesian computation gives a whopping 0.97 posterior probability to μ .9 (instead of merely 0.75, as on the analysis he endorses). Yet a significance tester wouldn’ t compute (2), it’ s outside significance testing. Considering the tails makes it harder, not easier, to find evidence against a null – when properly used in a significance test.

He’ s right that using α /POW( μ .9 ) as a likelihood ratio in computing a posterior probability for the point alternative μ .9 (using spiked priors of 0.5) gives some very strange answers. The tester isn’ t looking to compare point values, but to infer discrepancies. I’ m doing the best I can to relate his criticism to what significance testers do.

The error statistician deservedly conveys the lack of stringency owed to any inference to μ .9 . She finds the data give fairly good grounds that μ is less than μ .9 (confidence level 0.9 if one-sided, 0.8 if two-sided). Statistical quantities are like stew ingredients that you can jumble together into a farrago of computations, but there’ s a danger it produces an inference at odds with error statistical principles. For a significance tester, the alleged criticism falls apart; but it also leads me to question Goodman’ s posterior in (1).

What’ s Wrong with Using (1 − β )/α (or α /(1 − β )) to Compare Evidence?

I think the test taker from last year’ s cruise did pretty well, don’ t you? But her “ farrago” remark leads to a general point about the ills of using (1 − β )/ α and α /(1 − β ) to compare the evidence in favor of H 1 over H 0 or H 0 over H 1 , respectively. Let’ s focus on the (1 − β )/ α formulation. Benjamin and J. Berger (2016 ) call it the pre-data rejection ratio :

It is the probability of rejection when the alternative hypothesis is true, divided by the probability of rejection when the null hypothesis is true, i.e., the ratio of the power of the experiment to the Type I error of the experiment. The rejection ratio has a straightforward interpretation as quantifying the strength of evidence about the alternative hypothesis relative to the null hypothesis conveyed by the experimental result being statistically significant.

(p. 1)

But does it? I say no (and J. Berger says he concurs 3 ). Let’ s illustrate. Looking at Figure 5.3 , we can select an alternative associated with power as high as we like by dragging the curve representing H 1 to the right.

Imagine pulling it even further than alternative μ .9 . How about μ .999 ? If we consider the alternative then POW(T+, μ 1 ) = area to the right of −3 under the Normal curve, which is a whopping 0.999. For some numbers, use our familiar T+: H 0 : μ ≤ 0 vs. H 1 : μ > 0 , α = 0.025 , n = 25 , σ = 1 , . So the cut-off . Thus,

Let the observed outcome just reach the cut-off to reject H 0 , z 0 = 0.392. The rejection ratio is

POW(T+, μ = 1)/ α = 40 (i.e., 0.999/0.025)!

Would even a Likelihoodist wish to say the strength of evidence for μ = 1 is 40 times that of H 0 ? The data are even closer to 0 than to 1.

How then can it seem plausible, for comparativists, to compute a relative likelihood this way?

We can view it through their eyes as follows: take Z ≥ 1.96 as the lump outcome and reason along Likelihoodist lines. The probability is very high that Z ≥ 1.96 under the assumption that μ =1:

Pr(Z ≥ 1.96; μ 0.999 ) = 0.999.

The probability is low that Z ≥ 1.96 under the assumption that μ = μ 0 = 0 :

Pr(Z ≥ 1.96; μ = μ 0 ) = 0.025.

We’ ve observed z 0 = 1.96 (so Z ≥ 1.96).

Therefore, μ 0.999 (i.e., 1) makes the result Z ≥ 1.96 more probable than does μ = 0.

Therefore, the result is better evidence that μ =1 than it is for μ = 0 . But this likelihood reasoning only holds for the specific value of z . Granted, Bayarri, Benjamin, Berger, and Sellke (2016 ) recommend the prerejection ratio before the data are in, and “ the ‘ post-experimental rejection ratio’ (or Bayes factor) when presenting their experimental results” (p. 91). The authors regard the pre-data rejection ratio as frequentist, but it turns out they’ re using Berger’ s “ frequentist principle,” which, you will recall, is in terms of error probability2 (Section 3.6 ). A creation built on frequentist measures doesn’ t mean the result captures frequentist error statistical1 reasoning. It might be a kind of Frequentstein entity!

Notably, power works in the opposite way. If there’ s a high probability you should have observed a larger difference than you did, assuming the data came from a world where μ = μ 1 , then the data indicate you’ re not in a world where μ > μ 1 .

If Pr(Z > z 0 ; μ = μ 1 ) = high, then Z = z 0 is strong evidence that μ ≤ μ 1 !

Rather than being evidence for μ 1 , the just statistically significant result, or one that just misses, is evidence against μ being as high as μ 1 . POW( μ 1 ) is not a measure of how well the data fit μ 1 , but rather a measure of a test’ s capability to detect μ 1 by setting off the significance alarm (at size α ). Having set off the alarm, you’ re not entitled to infer μ 1 , but only discrepancies that have passed with severity (SIR). Else you’ re making mountains out of molehills.

5.3 Insignificant Results: Power Analysis and Severity

We’ re back at the Museum, and the display on power analysis. It is puzzling that many psychologists talk as if they’ re stuck with an account that says nothing about what may be inferred from negative results, when Cohen, the leader of power analysis, was toiling amongst them for years; and a central role for power, post-data, is interpreting non-significant results. The attention to power, of course, is a key feature of N-P tests, but apparently the prevalence of Fisherian tests in the social sciences, coupled, some speculate, with the difficulty in calculating power, resulted in power receiving short shrift. Cohen’ s work was to cure all that. Cohen supplied a multitude of tables (when tables were all we had) to encourage researchers to design tests with sufficient power to detect effects of interest. He bemoaned the fact that his efforts appeared to be falling on deaf ears. Even now, problems with power persist, and its use post-data is mired in controversy.

Th
e focus, in the remainder of this Tour, is on negative (or non-statistically significant) results. Test T+ fails to reject the null when the test statistic fails to reach the cut-off point for rejection, i.e., d( x 0 ) ≤ c α . A classic fallacy is to construe no evidence against H 0 as evidence of the correctness of H 0 . A canonical example was in the list of slogans opening this book: Failing to find an increased risk is not evidence of no risk increase, if your test had little capability of detecting risks, even if present (as when you made your hurdles too high). The problem is the flip side of the fallacy of rejection: here the null hypothesis “ survives” the test, but merely surviving can occur too frequently, even when there are discrepancies from H 0 .

Power Analysis Follows Significance Test Reasoning

Early proponents of power analysis that I’ m aware of include Cohen (1962 ), Gibbons and Pratt (1975 ), and Neyman (1955 ). It was the basis for my introducing severity, Mayo (1983 ). Both Neyman and Cohen make it clear that power analysis uses the same reasoning as does significance testing. 4 First Cohen:

[F]or a given hypothesis test, one defines a numerical value i (for i ota) for the [population] ES (effect size), where i is so small that it is appropriate in the context to consider it negligible (trivial, inconsequential). Power (1 – β ) is then set at a high value, so that β is relatively small. When, additionally, α is specified, n can be found. Now, if the research is performed with this n and it results in nonsignificance, it is proper to conclude that the population ES is no more than i , i.e., that it is negligible … .

(Cohen 1988 , p. 16; α , β substituted for his a, b )

Here Cohen imagines the researcher sets the size of a negligible discrepancy ahead of time – something not always available. Even where a negligible i may be specified, it’ s rare that the power to detect it is high. Two important points can still be made: First, Cohen doesn’ t instruct you to infer there’ s no discrepancy from H 0 , merely that it’ s “ no more than i .” Second, even if your test doesn’ t have high power to detect negligible i , you can infer the population discrepancy is less than whatever γ your test does have high power to detect. Some call this its detectable discrepancy size .

A little note on language. Cohen distinguishes the population ES and the observed ES, both in σ units. Keeping to Cohen’ s ESs for “ the effect size in the sample ” (1988 , p. 17) prevents a tendency to run them together. I continue to use “ discrepancy” and “ difference” for the population and observed differences, respectively, indicating the units being used.

Exhibit (iii): Ordinary Power Analysis.

Now for how the inference from power analysis is akin to significance testing. Let μ 1− β be the alternative against which the null in T+ has high power, 1 − β . Power analysis sanctions the inference that would accrue if we switched the null and alternative, yielding the one-sided test in the opposite direction, T−, we might call it. That is, T− tests H 0 : μ ≥ μ 1− β vs. H 1 : μ < μ 1− β at the β level. The test rejects H 0 (at level β ) when ≤ μ 0 − z β . Such a significant result would warrant inferring μ < μ 1− β at level β . Using power analysis doesn’ t require making this switcheroo. The point is that there’ s essentially no new reasoning involved in power analysis, which is why members of the Fisherian tribe manage it without mentioning power.

Ordinary Power Analysis : If data x are not statistically significantly different from H 0 , and the power to detect discrepancy γ is high, then x indicates that the actual discrepancy is no greater than γ .

A simple example: Use μ .84 in test T+ (H 0 : μ ≤ 0 vs. H 1 ; μ > 0, α = 0.025 , n = 25 , ) to create test T−. Test T+ has 0.84 power against (with our usual rounding). So, test T− is H 0 : μ ≥ 0.6 vs. H 1 : μ < 0.6 , and a result is statistically significantly smaller than 0.6 at level 0.16 whenever the sample mean . To check, note that .

It will be useful to look at the two-sided alternative: test T± . We’ d combine the above one-sided test with a test of H 0 : μ ≥ − μ 1− β vs. H 1 : μ < − μ 1− β at the β level. This will be to test H 0 : μ ≥ −0.6 vs. H 1 : μ < −0.6 , and find a result statistically significantly smaller than −0.6 at level 0.16 whenever the sample mean (i.e., ). If both nulls are rejected (at the 0.16 level), we infer | μ | < 0.6 but the two-sided test has double the Type I error probability: 0.32.

How high a power should be regarded as high? How low as low? A power of 0.8 or 0.9 is common, we saw, in “ clinically relevant” discrepancies. To anyone who complains that there’ s no way to draw a cut-off, note that we merely need to distinguish blatantly high from rather low values. Why have the probability of a Type II error exceed that of the Type I error? Some critics give Neyman and Pearson a hard time about this, but there’ s nothing in N-P tests to require it. Balance the errors as you like, N-P say. N-P recommend, based on tests in use, first to specify the test to reflect the Type I error as more serious than a Type II error. Second, choose a test that minimizes the Type II error probability, given the fixed Type I. In an example of testing a medical risk, Neyman says he places “ a risk exists” as the test hypothesis since it’ s worse (for the consumer) to erroneously infer risk absence (1950, chapter V). Promoters of the precautionary principle are often surprised to learn this about N-P tests. However, there’ s never an automatic “ accept/reject” in a scientific context.

Neyman Chides Carnap, Again

Neyman was an early power analyst? Yes, it’ s in his “ The Problem of Inductive Inference” (1955 ) where we heard Neyman chide Carnap for ignoring the statistical model (Section 2.7 ). Neyman says:

I am concerned with the term ‘ degree of confirmation’ introduced by Carnap … We have seen that the application of the locally best one-sided test to the data … failed to reject the hypothesis [that the n observations come from a source in which the null hypothesis is true]. The question is: does this result ‘ confirm’ the hypothesis [that H 0 is true of the particular data set]?

(ibid., p. 40)

The answer … depends very much on the exact meaning given to the words ‘ confirmation,’ ‘ confidence,’ etc. If one uses these words to describe one’ s intuitive feeling of confidence in the hypothesis tested H 0 , then … the attitude described is dangerous … the chance of detecting the presence [of discrepancy from the null], when only [n ] observations are available, is extremely slim, even if [the discrepancy is present]. Therefore, the failure of the test to reject H 0 cannot be reasonably considered as anything like a confirmation of H 0 . The situation would have been radically different if the power function [corresponding to a discrepancy of interest] were, for example, greater than 0.95.

(ibid., p. 41)

Ironically, Neyman also criticizes Fisher’ s move from a large P -value to inferring the null hypothesis as:

much too automatic [because] … large values of P may be obtained when the hypothesis tested is false to an important degree. Thus, … it is advisable to investigate … what is the probability (probability of error of the second kind) of obtaining a large value of P in cases when the [null is false … to a specified degree].

(1957a , p. 13)

Should this calculation show that the probability of detecting an appreciable error in the hypothesis tested was large, say 0.95 or greater, then and only then is the decision in favour of the hypothesis tested justifiable in the same sense as the decision against this hypothesis is justifiable when an appropriate test rejects it at a chosen level of significance.

(1957b , pp. 16– 17)

Typically, the hypothesis tested, [H 0 ] in the N-P context, could be swapped with the alternative. Let’ s leave the museum where a leader of the severe testing tribe makes some comparisons.

Attained Power Π

So power analysis is in the spirit of severe testing. Still, power analysis is calculated relative to an outcome just missing the cut-off c α . This corresponds to an observed difference whose P -value just exceeds α . This is, in effect, the worst case of a negative result. Wha
t if the actual outcome yields an even smaller difference (larger P -value)?

Consider test T+ ( α = 0.025) above. No one wants to turn pages so here it is: H 0 : μ ≤ 0 vs. H 1 : μ > 0 , α = 0.025 , n = 25 , σ = 1 , . So the cut-off , or, with the 2 cut-off, . Consider an arbitrary inference μ ≤ 0.2. We know POW(T+, μ = 0.2) = 0.16 (1 is subtracted from 0.4). A value of 0.16 is quite lousy power. It follows that no statistically insignificant result can warrant μ ≤ 0.2 for the power analyst. Power analysis only allows ruling out values as high as μ .8 , μ .84 , μ .9 , and so on. The power of a test is fixed once and for all and doesn’ t change with the observed mean . Why consider every non-significant result as if it just barely missed the cut-off? Suppose, . This is 2 lower than 0.2. Surely that should be taken into account? It is: 0.2 is the upper 0.975 confidence bound and . 5

What enables substituting the observed value of the test statistic, d( x 0 ), is the counterfactual reasoning of severity:

If, with high probability, the test would have resulted in a larger observed difference (a smaller P -value) than it did, if the discrepancy was as large as γ , then there’ s a good indication the discrepancy is no greater than γ , i.e., that μ ≤ μ 0 + γ .

‹ Prev Next ›