Book Read Free

Statistical Inference as Severe Testing

Page 46

by Deborah G Mayo

0.

  Granted, as Senn admits, the test “ lacks ambition” (ibid., p. 202), but with more data and with results surpassing the minimal cut-off, we may uncover a clinically relevant discrepancy. Why not just set up the test to enable the clinically relevant discrepancy to be inferred whenever the null is rejected?

  H 0 : μ ≤ Δ vs. H 1 : μ > Δ .

  This requires redefining Δ . “ It is no longer ‘ the difference we should not like to miss’ but instead becomes ‘ the difference we should like to prove obtains’” (ibid.). Some call this the “ clinically irrelevant difference” (ibid.). But then we can’ t also have high power to detect H 1 : μ > Δ .

  [I]f the true treatment difference is Δ , then the observed treatment difference will be less than Δ in approximately 50% of all trials. Therefore, the probability that it is less than the critical value must be greater than 50%.

  (ibid., p. 202)

  Indeed, it will be approximately 1 − α . So the power – the probability the observed difference exceeds the critical value under H 1 – is, in this case, around α . The researcher is free to specify the null as H 0 : μ ≤ Δ , but Senn argues against doing so, at least in drug testing, because “ a nonsignificant result will often mean the end of the road for a treatment. It will be lost forever. However, a treatment which shows a ‘ significant’ effect will be studied further” (ibid.). This goes beyond issues of interest now. The point is: Δ cannot be the value in H 0 and also the value against which we want 0.8 power to detect, i.e., μ .8 .

  If testing H 0 : μ ≤ 0 vs. H 1 : μ > 0 , then a just α -significant result is poor evidence for μ ≥ μ .8 (or other alternative with high power). To think it’ s good evidence is nonsense . Senn’ s related point is that it is ludicrous to assume the effect is either 0 or a clinically relevant difference, as if we are testing

  H 0 : μ = 0 vs. H 1 : μ > Δ .

  “ But where we are unsure whether a drug works or not, it would be ludicrous to maintain that it cannot have an effect which, while greater than nothing, is less than the clinically relevant difference” (ibid., p. 201). That is, it is ludicrous to cut out everything in between 0 and Δ . By the same token, it would seem odd to give a 0.5 prior probability to H 0 , and the remaining 0.5 to H 1 . We will have plenty of occasions to return to Senn’ s points about what’ s nonsensical and ludicrous.

  Trade-offs and Benchmarks

  Between H 0 and the power goes from α to 0.5.

  Keeping to our simple test T+ will amply reward us here.

  a. The power against H 0 is α . We can use the power function to define the probability of a Type I error or the significance level of the test:

  Standardizing , we get .

  The power at the null is Pr(Z ≥ z α ; μ 0 ) = α .

  It’ s the low power against H 0 that warrants taking a rejection as evidence that μ >μ 0 . This is desirable: we infer an indication of discrepancy from H 0 because a null world would probably have resulted in a smaller difference than we observed.

  b. The power of T+ for is 0.5. In that case, Z = 0 , and Pr(Z ≥ 0) = 0.5 , so

  The power only gets to be greater than 0.5 for alternatives that exceed the cut-off , whatever it is. As noted, since . Tests ensuring 0.9 power are also often of interest: . We get these shortcuts:

  Case 1 : POW(T+, μ ) for μ between H 0 and :

  If then POW(T+, μ 1 ) = area to the right of k under N(0,1) (≤ 0.5).

  Case 2 : POW(T+, μ ) for μ greater than :

  If then POW(T+, μ 1 ) = area to the right of −k under N (0,1) (> 0.5).

  Remember is .

  Trade-offs Between the Type I and Type II Error Probability

  We know that, for a given test, as the probability of a Type I error goes down the probability of a Type II error goes up (and power goes down). And as the probability of a Type II error goes down (and power goes up), the probability of a Type I error goes up, assuming we leave everything else the same. There’ s a trade-off between the two error probabilities. (No free lunch.) So if someone said: As the power increases, the probability of a Type I error decreases , they’ d be saying, as the Type II error decreases, the probability of a Type I error decreases. That’ s the opposite of a trade-off! You’ d know automatically they had made a mistake or were simply defining things in a way that differs from standard N-P statistical tests. Now you may say, “ I don’ t care about Type I and II errors, I’ m interested in inferring estimated effect sizes.” I too want to infer magnitudes. But those will be ready to hand once we tell what’ s true about the existing concepts.

  While μ .8 is obtained by adding 0.85 to , in day-to-day rounding, if you’ re like me, you’ re more likely to remember the result of adding 1 to . That takes us to a value of μ against which the test has 0.84 power, μ .84 :

  The power of test T+ to detect an alternative that exceeds the cut-off by 1 .

  In test T+ the range of possible values of and µ are the same, so we are able to set µ values this way, without confusing the parameter and sample spaces.

  Exhibit (i).

  Let test T+ ( α = 0.025) be H 0 : μ = 0 vs. H 1 : μ ≥ 0 , α = 0.025 , n = 25 , σ = 1 . Using the 2 cut-off: (using 1.96 it’ s 3.92). Suppose you are instructed to decrease the Type I error probability α to 0.001 but it’ s impossible to get more samples. This requires the hurdle for rejection to be higher than in our original test. The new cut-off for test T+ will be . It must be 3 greater than 0 rather than only 2 : . We decrease α (the Type I error probability) from 0.025 to 0.001 by moving the hurdle over to the right by 1 unit. But we’ ve just made the power lower for any discrepancy or alternative. For what value of µ does this new test have 0.84 power?

  POW(T+, α = 0.001, µ .84 = ?) = 0.84.

  We know: µ .84 = 0.6 + (0.2) = 0.8 . So, POW(T+, α = 0.001, µ = 0.8) = 0.84 . Decreasing the Type I error by moving the hurdle over to the right by 1 unit results in the alternative against which we have 0.84 power µ .84 also moving over to the right by 1 (Figure 5.1 ). We see the trade-off very neatly, at least in one direction.

  Figure 5.1 POW(T+, α = 0.001, µ = 0.8) = 0.84.

  Consider the discrepancy of μ = 0.6 (Figure 5.2 ). The power to detect 0.6 in test T+ ( α = 0.001) is now only 0.5! In test T+ ( α = 0.025) it is 0.84. Test T+ ( α = 0.001) is less powerful than T+ ( α = 0.025) .

  Figure 5.2 POW(T+, α = 0.001, µ = 0.6) = 0.5.

  Should you hear someone say that the higher the power, the higher the hurdle for rejection, you’ d know they are confused or using terms in an incorrect way. (The hurdle is how large the cut-off must be before rejecting at the given level.) Why then do Ziliak and McCloskey, popular critics of significance tests, announce: “ refutations of the null are trivially easy to achieve if power is low enough or the sample is large enough” (2008a , p. 152)? Increasing sample size means increased power, so the second disjunct is correct. The first disjunct is not. One might be tempted to suppose they mean “ power is high enough,” but one would be mistaken. They mean what they wrote. Aris Spanos (2008a) points this out (in a review of their book), and I can’ t figure out why they dismiss such corrections as “ a lot of technical smoke” (2008b , p. 166).

  Ziliak and McCloskey Get Their Hurdles in a Twist

  Still, their slippery slides are quite illuminating.

  If the power of a test is low, say, 0.33, then the scientist will two times in three accept the null and mistakenly conclude that another hypothesis is false. If on the other hand the power of a test is high, say, 0.85 or higher, then the scientist can be reasonably confident that at minimum the null hypothesis (of, again, zero effect if that is the null chosen) is false and that therefore his rejection of it is highly probably correct.

  (Ziliak and McCloskey 2008a , p. 132– 3)

  With a wink and a nod, the first sentence isn’ t too bad, even though, at the very least, it is mandatory to specify a particular “ another hypothesis,” µ ′ . But what about the statement: if the power of a test is high, then a rejection of t
he null is probably correct?

  We follow our rule of generous interpretation to try to see it as true. Let’ s allow the “ ;” in the first premise to be a conditional probability “ |” , using μ 0.84 :

  1. Pr(Test T+ rejects the null | μ .84 ) = 0.84.

  2. Test T+ rejects the null hypothesis.

  Therefore, the rejection is correct with probability 0.84.

  Oops. The premises are true, but the conclusion fallaciously transposes premise 1 to obtain conditional probability Pr( μ .84 | test T+ rejects the null) = 0.84. What I think they want to say, or at any rate what would be correct, is

  Pr(Test T+ does not reject the null hypothesis | μ .84 ) = 0.16.

  So the Type II error probability is 0.16. Looking at it this way, the flaw is in computing the complement of premise 1 by transposing (as we saw in the Higgs example, Section 3.8). Let’ s be clear about significance levels and hurdles. According to Ziliak and McCloskey:

  It is the history of Fisher significance testing. One erects little “ significance” hurdles, six inches tall, and makes a great show of leaping over them, … If a test does a good job of uncovering efficacy, then the test has high power and the hurdles are high not low.

  (ibid., p. 133)

  They construe “ little significance” as little hurdles! It explains how they wound up supposing high power translates into high hurdles. It’ s the opposite. The higher the hurdle, the more difficult it is to reject, and the lower the power. High hurdles correspond to insensitive tests, like insensitive fire alarms. It might be that using “ sensitivity” rather than power would make this abundantly clear. We may coin: The high power = high hurdle (for rejection) fallacy. A powerful test does give the null hypothesis a harder time in the sense that it’ s more probable that discrepancies from it are detected. That makes it easier for H 1 . Z & M have their hurdles in a twist.

  5.2 Cruise Severity Drill: How Tail Areas (Appear to) Exaggerate the Evidence

  The most influential criticisms of statistical significance tests rest on highly plausible intuitions, at least from the perspective of a probabilist. We are about to visit a wonderfully instructive example from Steven Goodman. It combines the central skills gathered up from our journey, but with a surprising twist. As always, it’ s a canonical criticism – not limited to Goodman. He happens to give a much clearer exposition than most, and, on top of that, is frank about his philosophical standpoint. Let’ s listen:

  To examine the inferential meaning of the p value, we need to review the concept of inductive evidence. An inductive measure assigns a number (a measure of support or credibility) to a hypothesis, given observed data. … By this definition the p value is not an inductive measure of evidence, because it involves only one hypothesis and because it is based partially on unobserved data in the tail region.

  To assess the quantitative impact of these philosophical issues, we need to turn to an inductive statistical measure: mathematical likelihood.

  (Goodman 1993 , p. 490)

  Well that settles things quickly. Influenced by Royall, Goodman has just listed the keynotes from the standpoint of “ evidence is comparative likelihood” seen as far back as Excursion 1 (Tour II) . Like the critics we visited in Excursion 4 (Sections 4.4 and 4.5 ), Goodman finds that the P -value exaggerates the evidence against a null hypothesis because the likelihood ratio (or Bayes factor) in favor of a chosen alternative is not as large as the P -value would suggest. He admits that one’ s assessment here will turn on philosophy. On Goodman’ s philosophy, it’ s the use of the tail area that deceitfully blows up the evidence against the null hypothesis. Now in Section 3.4, Jeffreys’ tail area criticism, we saw that considering the tails makes it harder, not easier, to find evidence against a null. Goodman purports to show the opposite. That’ s the new twist.

  Three Steps to the Argument

  Goodman’ s context involves statistically significant results – he writes them as z values, as it is a case of Normal testing. We’ re not given the sample size or the precise test, but it won’ t matter for the key argument. He gives the two-sided α value, although “ [w]e assume that we know the direction of the effect” (ibid., p. 491). I am not objecting. Even if we run a two-sided test, once we see the direction, it makes sense to look at the power of the relevant one-sided test, but double the α value. There are three steps. First step : form the likelihood ratio of the statistically significant outcome z 0.025 (i.e., 1.96) under the null hypothesis and some alternative (where Pr is the density):

  Pr(z α ; H ′ )/Pr(z α ; H 0 ).

  But which alternative H ′ ? Alternatives against which the test has high power, say 0.8, 0.84, or 0.9, are of interest, and he chooses μ .9 . He writes this alternative as μ = Δ 0.05, 0.9 , “ the difference against which the hypothesis test has two-sided α = 0.05 and one-sided β = 0.10 (power = 0.90)” (ibid., p. 496). We know from our benchmarks that μ .9 is approximately 1.28 standard errors from the one-sided cut-off: . The likelihood for alternative μ .9 is smaller than if one had used the maximum likely alternative (as in Johnson). 1

  I’ ll follow Goodman in computing the likelihood of the null over the alternative, although most of the authors we’ ve considered do the reverse. Not that it matters so long as you keep them straight.

  “ Two likelihood ratios are compared, one for a ‘ precise’ p value, e.g., p = 0.03, and one for … p ≤ α = 0.03. The alternative hypothesis used here is the one against which the hypothesis test has 90 percent power (two-sided α = 0.05) …” (ibid., pp. 490– 1). The precise and imprecise P -values correspond to reporting z = 1.96 and z ≥ 1.96, respectively. 2 The likelihood ratio for the precise P -value “ corresponds to the ratio of heights of the two probability densities” (ibid., p. 491): Number this as (1):

  (1) Pr(Z = 1.96; μ 0 )/Pr(Z = 1.96; μ .9 ) = 0.058/0.176 = 0.33.

  These are the ordinates not the tail areas of the Normal distribution.

  That’ s the first step. The second step is to consider the likelihood ratio for the imprecise P -value, where the result is described coarsely as {z ≥ 1.96} rather than z = 1.96 (or equivalently p ≤ 0.025 rather than p = 0.025):

  (2) Pr(Z ≥ 1.96; μ 0 )/ Pr(Z ≥ 1.96; μ .9 ).

  We see at once that the value of (2) is

  α / POW( μ .9 ) = 0.025/0.9 = 0.03.

  The comparative evidence for the null using (1) is considerably larger (0.33) than using (2) where it’ s only 0.03. Figure 5.3 shows what’ s being compared.

  Figure 5.3 Comparing precise and imprecise P -values. The likelihood ratio is A /B , the ratio of the curve heights at the observed data. The likelihood ratio associated with the imprecise P -value (p ≤ α ) is the ratio of the small darkly shaded area to the total shaded area.

  (adapted from Goodman 1993 , Figure 1 p. 492)

  The difference Goodman wishes to highlight looks even more impressive if we flip the ratios in (1) and (2) to get the comparative evidence for the alternative compared to the null. We get 0.176/0.058 = 3.03 using z = 1.96, and 0.9/0.025 = 36 using z ≥ 1.96 . Either way you look at it, using the tail areas to compare support exaggerates the evidence against the null in favor of μ .9 . Or so it seems.

  Now for the third step: Assign priors of 0.5 to μ 0 and to μ .9 :

  With this alternative hypothesis [ μ .9 ], ‘ p ≤ 0.05’ represents 11 times (= 0.33/0.03) less evidence in support of the null hypothesis than does ‘ p = 0.05.’ Using Bayes’ Theorem, with initial probabilities of 50 percent on both hypotheses (i.e., initial odds = 1), this means that after observing p = 0.05, the probability that the null hypothesis is true falls only to 25 percent (= 0.33/(1 + 0.33)). When p ≤ 0.05 , the truth probability of the null hypothesis drops to 3 percent (= 0.03/(1 + 0.03)).

  (ibid., p. 491)

  He concludes:

  When we use the tail region to represent a result that is actually on the border, we misrepresent the evidence, making the case against the null hypothesis look much stronger than it actually is.

 
(ibid.)

  The posterior probabilities, he says, are assessments of credibility in the hypothesis. Join me for a break at the coffeehouse, called “ Take Type II” , where we’ ll engage a live exhibit.

  Live Exhibit (ii): Drill Prompt.

  Using any relevant past travels, tours, and souvenirs, critically appraise his argument. Resist the impulse to simply ask, “ where do significance tests recommend using error probabilities to form a likelihood ratio, and treat the result as a relative support measure?” Give him the most generous interpretation in order to see what’ s being claimed.

  How do you grade the following test taker?

  Goodman actually runs two distinct issues together: The first issue contrasts a report of the observed P -values with a report of whether or not a predesignated cut-off is met (his “ imprecise” P -value); the second issue is using likelihood ratios (as in (1)) as opposed to using tail areas. As I understand him, Goodman treats a report of the precise P -value as calling for a likelihood analysis as in (1), supplemented in the third step to get a posterior probability. But, even a reported P -value will lead to the use of tail areas in Fisherian statistics. This might not be a major point, but it is worth noting. For the error statistician, use of the tail area isn’ t to throw away the particular data and lump together all z ≥ z α . It’ s to signal an interest in the probability the method would produce z ≥ z α under various hypotheses, to determine its capabilities (Translation Guide, Souvenir C). Goodman’ s hypothesis tester only reports z ≥ z α and so he portrays the Bayesian (or Likelihoodist) as forced to compute (2). The error statistical tester doesn’ t advocate this.

 

‹ Prev