Statistical Inference as Severe Testing

Page 50

by Deborah G Mayo

μ 1 ; while if POW( μ 1 ) is low it’ s good evidence that μ > μ 1 . Is their retrospective design analysis at odds with severity, P -values, and confidence intervals? Here’ s one way of making their assertion true using test T+: If you take the observed mean as the estimate of μ , and you happen to know the true value of μ is smaller than – between μ = μ 0 and (where the power ranges from α to 0.5.) – then obviously exceeds (“ exaggerates” ) μ . Still I’ m not sure this brings agreement.

Let’ s use our water plant accident testing µ ≤ 150 vs. µ > 150 (with ). The critical value for α = 0.025 is d 0.025 = 1.96 , or . You observe a just statistically significant result. You reject the null hypothesis and infer µ > 150. Gelman and Carlin write:

[An] unbiased estimate will have 50% power if the true effect is 2 standard errors away from zero, it will have 17% power if the true effect is 1 standard error away from 0, and it will have 10% power if the true effect is 0.65 standard errors away from 0.

(ibid., p. 4)

These correspond to µ =152, µ =151 , and µ =150.65 . It’ s odd to talk of an estimate having power; what they mean is that the test T+ has a power of 0.5 to detect a discrepancy 2 standard errors away from 150, and so on. The “ unbiased estimate” here is the statistically significant . To check that we match their numbers, compute POW(µ = 152), POW(µ = 151) , and POW(µ = 150.65) 4 :

(a) ;

(b) ;

(c) .

They appear to be saying that there’ s better evidence for µ ≥ 152 than for µ ≥ 151 than for µ ≥ 150.65, since the power assessments go down. Nothing changes if we write >. Notice that the SEV computations for µ ≥ 152, µ ≥ 151, µ ≥ 150.65 are the complements of the corresponding powers 0.49, 0.83, 0.9. So the lower the power for μ 1 the stronger the evidence for µ > μ 1 . Thus there’ s disagreement. But let’ s try to pursue their thinking.

Suppose we observe . Say we have excellent reason to think it’ s too big. We’ re rather sure the mean temperature is no more than ~150.25 or 150.5, judging from previous cooling accidents, or perhaps from the fact that we don’ t see some drastic effects expected from water that hot. Thus 152 is an overestimate . The observed mean “ exaggerates” what you know on good evidence to be the correct mean (< 150.5). No one can disagree with that, although they measure the exaggeration by a ratio. 5 Is this “ power analytic” reasoning? No, but no matter. Some remarks:

First, the inferred estimate would not be 152 but rather the lower confidence bounds, say, , i.e., µ > 150 (for a 0.975 lower confidence bound). True, but suppose the lower bound at a reasonable confidence level is still at odds with what we assume is known. For example, a lower 0.93 bound is µ > 150.5. What then? Then we simply have a conflict between what these data indicate and assumed background knowledge.

Second, do they really want to say that the statistically significant fails to warrant µ ≥ µ 1 for any µ 1 between 150 and 152 on grounds that the power in this range is low (going from 0.025 to 0.5)? If so, the result surely couldn’ t warrant values larger than 152. So it appears no values would be able to be inferred from the result.

A way to make sense of their view is to construe it as saying the observed mean is so out of line with what’ s known that we suspect the assumptions of the test are questionable or invalid. Suppose you have considerable grounds for this suspicion: signs of cherry picking, multiple testing, artificiality of experiments, publication bias, and so forth – as are rife in both examples given in Gelman and Carlin’ s paper. You have grounds to question the result because you question the reported error probabilities . Indeed, no values can be inferred if the error probabilities are spurious, the severity is automatically low.

One reasons, if the assumptions are met, and the error probabilities approximately correct, then the statistically significant result would indicate µ > 150.5, P -value 0.07, or severity level 0.93. But you happen to know that µ ≤ 150.5. Thus, that’ s grounds to question whether the assumptions are met. You suspect it would fail an audit. In that case put the blame where it belongs. 6

Recall the (2010) study purporting to show genetic signatures of longevity (Section 4.3 ). Researchers found the observed differences suspiciously large, and sure enough, once reanalyzed, the data were found to suffer from the confounding of batch effects. When results seem out of whack with what’ s known, it’ s grounds to suspect the assumptions. That’ s how I propose to view Gelman and Carlin’ s argument; whether they concur is for them to decide.

5.6 Positive Predictive Value: Fine for Luggage

Many alarming articles about questionable statistics rely on alarmingly questionable statistics. Travelers on this cruise are already very familiar with the computations, because they stem from one or another of the “ P -values exaggerate evidence” arguments in Sections 4.4 , 4.5 , and 5.2 . They are given yet another new twist, which I will call the diagnostic screening (DS) criticism of significance tests. To understand how the DS criticism tests really took off, we should go back to a paper by John Ioannidis (2005 ):

Several methodologists have pointed out that the high rate of nonreplication (lack of confirmation) of research discoveries is a consequence of the convenient, yet ill-founded strategy of claiming conclusive research findings solely on the basis of a single study assessed by formal statistical significance, typically for a p -value less than 0.05. Research is not most appropriately represented and summarized by p -values, but, unfortunately, there is a widespread notion that medical research articles should be interpreted based only on p -values. Research findings are defined here as any relationship reaching formal statistical significance, e.g., effective interventions, informative predictors, risk factors or associations.

It can be proven that most claimed research findings are false.

(p. 0696)

First, do medical researchers claim to have “ conclusive research findings” as soon as a single statistically significant result is spewed out of their statistical industrial complexes? Do they go straight to press? Ioannidis says that they do. Fisher’ s ghost is screaming. (He is not talking of merely identifying a possibly interesting result for further analysis.) However absurd such behavior sounds 80 years after Fisher exhorted us never to rely on “ isolated results,” let’ s suppose Ioannidis is right. But it gets worse. Even the single significant result is very often the result of the cherry picking and multiple testing we are all too familiar with:

… suppose investigators manipulate their design, analysis, and reporting so as to make more relationships cross the p = 0.05 threshold … Such manipulation could be done, for example, with serendipitous inclusion or exclusion of certain patients or controls, post hoc subgroup analyses, investigation of genetic contrasts that were not originally specified … Commercially available ‘ data mining’ packages actually are proud of their ability to yield statistically significant results through data dredging.

(ibid., p. 0699)

The DS criticism of tests shows that if

1. you publish upon getting a single P -value < 0.05,

2. you dichotomize tests into “ up-down” outputs rather than report discrepancies and magnitudes of effect,

3. you data dredge and cherry pick and/or

4. there is a sufficiently low probability of genuine effects in your field, the notion of probability to be unpacked,

then the probability of true nulls among those rejected as statistically significant – a value we call the false finding rate (FFR) 7 – differs from and can be much greater than the Type I error set by the test.

However one chooses to measure “ bad evidence, no test” (BENT) results, nobody is surprised that such bad behavior qualifies. For the severe tester, committing #3 alone is suspect, unless an adjustment to get proper error probabilities is achieved. Even if there’ s no cherry picking, and your test has a legitimate Type I error probability of 0.05, a critic will hold that the FFR can be much higher than 0.05, if you’ ve randomly selected your null hypothesis from a group with a suffici
ently high proportion of true “ no effect” nulls. So is the criticism a matter of transposing probabilistic conditionals, only with the twist of trying to use “ prevalence” for a prior? It is, but that doesn’ t suffice to dismiss the criticism. The critics argue that the quantity we should care about is the FFR, or its complement, the positive predictive value (PPV). Should we? Let’ s look at all this in detail.

Diagnostic Screening

In scrutinizing a statistical argument, particularly one that has so widely struck a nerve, we attempt the most generous interpretation. Still, if we are not to jumble up our freshly acquired clarity on statistical power, we need to use the proper terms for diagnostic screening, at least one model for it. 8

We are all plagued by the TSA (Transportation Security Administration) screening in airports, although thankfully they have gotten rid of those whole body scanners in which all “ your junk” is revealed to anonymous personnel. The latest test, we are told, would very rarely miss a dangerous item in your carry-on, and rarely trigger the alarm (+) for nothing. Yet most of the alarms are false alarms. That’ s because the dangerous items are relatively rare. On the other hand, sending positive (+) results for further scrutiny – usually in front of gloved representatives who have pulled you aside as they wave special wands and powder – ensures that, taken together, the false findings are quite rare. On the retest, they will usually discover you’ d simply forgotten to remove that knife or box cutter from the last trip. Interestingly, the rarity of dangerous bags – that is, the low prevalence of D’ s (D for danger) – means we can be comforted in a negative result. So we’ d often prefer not to lower the sensitivity, but control false positives relying on the follow-up retest given to any “ +” result. (Mayo and Morey 2017.)

Positive Predictive Value (PPV) (1 – FFR).

To get the (PPV) we are to apply Bayes’ Rule using the given relative frequencies (or prevalences):

The sensitivity is the probability that a randomly selected item with D will be identified as “ positive” (+):

SENS: Pr(+|D).

The specificity is the probability a randomly selected item lacking D will be found negative (−):

SPEC: Pr(−|~D).

The prevalence is just the relative frequency of D in some population.

We run the test on the item (be it a person, a piece of luggage, or a hypothesis) and report either + or −. Instead of populations of carry-on bags and luggage, imagine an urn of null hypotheses, 50% of which are true. Randomly selecting a hypothesis, we run a test and output + (statistically significant) or − (non-significant). So our urn represents the proverbial “ prior” probability of 50% true nulls.

The criticism turns on the PPV being too low. Even with Pr(D) = 0.5 , with Pr(+|~D) = 0.05 and Pr(+|D)= 0.8 , we still get a rather high PPV:

With Pr(D) = 0.5 , all we need for a PPV greater than 0.5 is for Pr(+|~D) to be less than Pr(+|D) . It suffices that the probability of ringing the alarm when we shouldn’ t is less than the probability of ringing it when we should. With a prevalence Pr(D) very small, e.g.,
Ioannidis rightly points out that many researchers are guilty of cherry picking and selection effects under his “ bias” umbrella. The actual Pr(+|~D) , with bias, is now the probability the “ +” was generated by chance plus the probability it was generated by “ bias.” ~D plays the role of H 0 . Even the lowest presumed bias, 0.10, changes a 0.05 into 0.14.

Actual Pr(+|~D):= “ alleged” Pr(+| ~D) + Pr(– | ~D)(0.10) = (0.05) + (0.95)(0.10) = 0.14.

The PPV has now gone down to 0.85. Or consider if you’ re lucky enough to get a TSA official with 30% bias. Your “ alleged” Pr(+| ~D) is again 0.05, but with 30% bias, the actual Pr(+|~D) = 0.05 + (0.95)(0.3) = 0.33 . Table 5.1 lists some of the top (better) and bottom (worse) entries from Ioannidis’ Table, keeping the notation of diagnostic tests. Some of the PPVs, especially for exploratory research with lots of data dredging, get very low PPVs.

Table 5.1 Selected entries from Ioannidis (2005)

Pr(+|D) PREV of D Bias Practical example PPV

0.8 50% 0.10 Adequately powered RCT, little bias 0.85

0.95 67% 0.30 Confirmatory meta-analysis of good-quality RCTs 0.85

0.8 9% 0.3 Adequately powered exploratory epidemiological study 0.20

0.2 0.1% 0.8 Discovery-oriented exploratory research with massive testing .001

Where do his bias adjustments come from? These are just guesses he puts forward. It would be interesting to see if they correlate with some of the better- known error adjustments, as with multiple testing. If so, maybe Ioannidis’ bias assignments can be seen as giving another way to adjust error probabilities. The trouble is, the dredging can be so tortured in many cases that we’ d be inclined to dismiss the study rather than give it a PPV number. (Perhaps confidence intervals around the PPV estimate should be given?)

Ioannidis will also adjust the prevalence according to the group that your research falls into, leading Goodman and Greenland (2007 ) to charge him with punishing the epidemiologist twice: by bias and low prevalence! I’ m sympathetic with those who protest that rather than assume guilt (or innocence) by association (with a given field), it’ s better to see what crime was actually committed or avoided by the study at hand. Even bias violations are open to appeal, and may have been gotten around by other means. (No mention is given of failed statistical assumptions, which can quickly turn to mush the reported error probabilities, and preclude the substantive inference that is the actual output of research. Perhaps this could be added.) Others who mount the DS criticism allege that the problem holds even accepting the small α level and no bias. 9 Their gambit is to sufficiently lower the prevalence of D – which now stands for probability of a “ true effect” – so that the PPV is low (e.g., Colquhoun 2014 ). Colquhoun’ s example retains Pr(+|~D) = 0.05, Pr(+|D)= 0.8 , but shrinks the prevalence Pr(D) of true effects down to 10%. That is, 90% of the nulls in your research universe are true. This yields a PPV of 64%. The Pr(~D|+) is 0.36, much greater than Pr(+|~D) = 0.05 .

So the DS criticism appears to go through with these computations. What about exporting the terms from significance tests into FFR or PPV assessments? We haven’ t said anything about treating ~D as H 0 in the DS criticism.

[ α /(1 – β )] Again

Although we are keen to get away from coarse dichotomies, in the DS model of tests we are to consider just two possibilities: “ no effect” and “ real effect.” The null hypothesis is treated as H 0 : 0 effect ( μ = 0) , while the alternative H 1 : the discrepancy against which the test has power (1 − β ). It is assumed the probability for finding any effect, regardless of size, is the same (Ioannidis 2005, p. 0696). Then [ α /(1 − β )] is used as the likelihood ratio to compute the posterior of either H 0 or H 1 – a problematic move, as we know.

An example of one of their better tests might have H 1 : μ = μ .9 where μ .9 is the alternative against which the test has 0.9 power. But now the denial of the alternative H 1 does not yield the same null hypothesis used to obtain the Type I error probability of 0.05. Instead it would be high, nearly as high as 0.9. Likewise if the null is chosen to have low α , then its denial won’ t be one against which the test has high power (it will be close to α ). Thus, the identification of “ effect” and “ no effect” with the hypotheses used to compute the Type I error probability and power are inconsistent with one another. The most plausible way to construe the DS argument is to assume the critics have in mind a test between a point null H 0 , or a small interval around it, and a non-exhaustive alternative hypothesis against which there is a specified power such as 0.9. It is known that there are intermediate values of μ , but the inference will just compare two.

The DS critics will give a high PPV to alternatives with high power, which is often taken to be 0.8
or 0.9. We know the computation from Goodman (Section 5.2) that “ the truth probability of the null hypothesis drops to 3 percent (= 0.03/(1 + 0.03)).” The PPV for μ .9 is 0.97. We haven’ t escaped Senn’ s points about the nonsensical and the ludicrous, or making mountains out of molehills. To infer μ .9 based on α = 0.025 (one-sided) is to be wrong 90% of the time. We’ d expect a more significant result 90% of the time were μ .9 correct. I don’ t want to repeat what we’ ve seen many times. Even using Goodman’ s “ precise P -value” yields a high posterior. A DS critic could say: you compute error probabilities but we compute PPV, and our measure is better. So let’ s take a look at what the computation might mean.

In the typical illustrations it’ s the prevalence that causes the low PPV. But what is it? Colquhoun (2014 ) identifies Pr(D) with “ the proportion of experiments we do over a lifetime in which there is a real effect” (p. 9). Ioannidis (2005 ) identifies it with “ the number of ‘ true relationships’ … among those tested in the field” (p. 0696). What’ s the relevant reference class for the prevalence Pr(D)? We scarcely have a list of all hypotheses to be tested in a field, much less do we know the proportion that are “ true.” With continuous parameters, it could be claimed there are infinitely many hypotheses; individuating true ones could be done in multiple ways. Even limiting the considerations to discrete claims (effect/no effect), will quickly land us in quicksand. Classifying by study type makes sense, but any umbrella will house studies from different fields with different proportions of true claims.

One might aver that the PPV calculation is merely a heuristic to show the difference between α and FFR, or between (1 − α ) and the PPV. It should always be kept in mind that even when a critic has performed a simulation, it is a simulation that assumes ingredients. If aspects of the calculation fail, then of what value is the heuristic? Furthermore, it is clear that the PPV calculation is intended to assess the results of actual tests. Even if we agreed on a reference class, say the proportion of true effects over your lifetime of testing is θ , this probability θ wouldn’ t be the probability that a selected effect is “ true.” It would not be a frequentist prior probability for the randomly selected hypothesis. We now turn to this.

‹ Prev Next ›