Statistical Inference as Severe Testing
Page 36
0) . The contrast, Senn observes, is that of Cox’ s distinction between “ precise and dividing hypothesis” (Section 3.3 ). “ [F]rom a common belief in the drug’ s efficacy they have moved in opposite directions” (ibid., pp. 200– 201). Senn riffs on Jeffreys’ well-known joke that we heard in Section 3.4 :
It would require that a procedure is dismissed [by significance testers] because, when combined with information which it doesn’ t require and which may not exist, it disagrees with a [Bayesian] procedure that disagrees with itself.
(ibid., p. 195)
In other words, if Bayesians disagree with each other even when they’ re measuring the same thing – posterior probabilities – why be surprised that disagreement is found between posteriors and P -values? The most common argument behind the “ P -values exaggerate evidence” appears not to hold water. Yet it won’ t be zapped quite so easily, and will reappear in different forms.
Exhibit (vii): Contrasting Bayes Factors and Jeffreys– Lindley Paradox.
We’ ve uncovered some interesting bones in our dig. Some lead to seductive arguments purporting to absolve the latitude in assigning priors in Bayesian tests. Take Wagenmakers and Grünwald (2006 , p. 642): “ Bayesian hypothesis tests are often criticized because of their dependence on prior distributions … [yet] no matter what prior is used, the Bayesian test provides substantially less evidence against H 0 than” P -values, in the examples we’ ve considered. Be careful in translating this. We’ ve seen that what counts as “ less” evidence runs from seriously underestimating to overestimating the discrepancy we are entitled to infer with severity. Begin with three types of priors appealed to in some prominent criticisms revolving around the Fisher – Jeffreys disagreement.
1. Jeffreys-type prior with the “ spike and slab” in a two-sided test. Here, with large enough n , a statistically significant result becomes evidence for the null; the posterior to H 0 exceeds the lump prior.
2. Likelihood ratio most generous to the alternative . Here, there’ s a spike to a point null, H 0 : θ = θ 0 to be compared to the point alternative that’ s maximally likely θ max . Often, both H 0 and H max are given 0.5 priors.
3. Matching . Instead of a spike prior on the null, it uses a smooth diffuse prior, as in the “ dividing” case. Here, the P -value “ is an approximation to the posterior probability that θ < 0” (Pratt 1965 , p. 182).
In sync with our attention to high-energy particle physics (HEP) in Section 3.6 , consider an example that Aris Spanos (2013b ) explores in relation to the Jeffreys– Lindley paradox. The example is briefly noted in Stone (1997 ).
A large number (n = 527,135) of independent collisions that can be of either type A or type B are used to test if the proportion of type A collisions is exactly 0.2, as opposed to any other value. It’ s modeled as n Bernoulli trials testing H 0 : θ = 0.2 vs. H 1 : θ ≠ 0.2 . The observed proportion of type A collisions is scarcely greater than the point null of 0.2:
The significance level against H 0 is small (so there’ s evidence against H 0 )
The test statistic , , which under the null is . The significance level associated with d( x 0 ) in this two-sided test is
Pr(|d( X )| > |d( x 0 )|;H 0 ) = 0.0027.
So the result is highly significant, even though it’ s scarcely different from the point null.
The Bayes factor in favor of H 0 is high
H 0 is given the spiked prior of 0.5, and the remaining 0.5 is spread equally among the values in H 1 . I follow Spanos’ computations: 6
where n = 527,135 and k = 106,298.
While the likelihood of H 0 in the numerator is tiny, the likelihood of H 1 is even tinier. Since B 01 in favor of H 0 is 8, which is greater than 1, the posterior for H 0 goes up, even though the outcome is statistically significantly greater than the null.
There’ s no surprise once you consider the Bayesian question here: compare the likelihood of a result scarcely different from 0.2 being produced by a universe where θ = 0.2 – where this has been given a spiked prior of 0.5 under H 0 – with the likelihood of that result being produced by any θ in a small band of θ values, which have been given a very low prior under H 1 . Clearly, θ = 0.2 is more likely, and we have an example of the Jeffreys– Fisher disagreement.
Who should be afraid of this disagreement (to echo the title of Spanos’ paper)? Many tribes, including some Bayesians, think it only goes to cast doubt on this particular Bayes factor. Compare it with proposal 2 in Exhibit (vii): the Likelihood ratio most generous to the alternative : Lik(0.2)/Lik( θ max ). We know the maximally likely value for θ , :
Now B 01 is 0.01 and B 10 , Lik( θ max )/Lik(0.2) = 89.
Why should a result 89 times more likely under alternative θ max than under θ = 0.2 be taken as strong evidence for θ = 0.2? It shouldn’ t, according to some, including Lindley’ s own student, default Bayesian José Bernardo (2010 ). Presumably, the Likelihoodist concurs. There are family feuds within and between the diverse tribes of probabilisms. 7
Greenland and Poole
Given how often spiked priors arise in foundational arguments, it’ s worth noting that even Bayesians Edwards, Lindman, and Savage (1963 , p. 235), despite raising the “ P -values exaggerate” argument, aver that for Bayesian statisticians, “ no procedure for testing a sharp null hypothesis is likely to be appropriate unless the null hypothesis deserves special initial credence.” Epidemiologists Sander Greenland and Charles Poole, who claim not to identify with any one statistical tribe, but who often lead critics of significance tests, say:
Our stand against spikes directly contradicts a good portion of the Bayesian literature, where null spikes are used too freely to represent the belief that a parameter ‘ differs negligibly’ from the null. In many settings … even a tightly concentrated probability near the null has no basis in genuine evidence. Many scientists and statisticians exhibit quite a bit of irrational prejudice in favor of the null … .
(2013 , p. 77)
They angle to reconcile P -values and posteriors, and to this end they invoke the matching result in # 3, Exhibit (vii). An uninformative prior, assigning equal probability to all values of the parameter, allows the P -value to approximate the posterior probability that θ < 0 in one-sided testing ( θ ≤ 0 vs. θ > 0). In two-sided testing, the posterior probability that θ is on the opposite side of 0 than the observed is P /2. They proffer this as a way “ to live with” P -values. Commenting on them, Andrew Gelman (2013 , p. 72) raises this objection:
[C]onsider what would happen if we routinely interpreted one-sided P values as posterior probabilities. In that case, an experimental result that is 1 standard error from zero – that is, exactly what one might expect from chance alone – would imply an 83% posterior probability that the true effect in the population has the same direction as the observed pattern in the data at hand. It does not make sense to me to claim 83% certainty – 5 to 1 odds [to H 1 ] …
(The P -value is 0.16.) Rather than relying on non-informative priors, Gelman prefers to use prior information that leans towards the null. This avoids as high a posterior to H 1 as when using the matching result.
Greenland and Poole respond that Gelman is overlooking the hazard of “ strong priors that are not well founded. … Whatever our prior opinion and its foundation, we still need reference analyses with weakly informative priors to alert us to how much our prior probabilities are driving our posterior probabilities” (2013 , p. 76). They rightly point out that, in some circles, giving weight to the null can be the outgrowth of some ill-grounded metaphysics about “ simplicity.” Or it may be seen as an assumption akin to a presumption of innocence in law. So the question turns on the appropriate prior on the null.
Look what has happened! The problem was simply to express “ I’ m not impressed” with a result reaching a P -value of 0.16: Differences even larger than 1 standard error are not so very infrequent – they occur 16% of the time – even if there’ s zero effect. So I’ m not convinced of the reality of the effect, base
d on this result. P -values did their job, reflecting as they do the severity requirement. H 1 has passed a lousy test. That’ s that. No prior probability assignment to H 0 is needed. Problem solved.
But there’ s a predilection for changing the problem (if you’ re a probabilist). Greenland and Poole feel they’ re helping us to live with P -values without misinterpretation. By choosing the prior so that the P -value matches the posterior on H 0 , they supply us “ with correct interpretations” (ibid., p. 77) where “ correct interpretations” are those where the misinterpretation (of a P -value as a posterior in the null) is not a misinterpretation. To a severe tester, this results in completely changing the problem from an assessment of how well tested the reality of the effect is, with the given data, to what odds I would give in betting, or the like. We land in the same uncharted waters as with other attempts to fix P -values, when we could have stayed on the cruise ship, interpreting P -values as intended.
Souvenir Q: Have We Drifted From Testing Country? (Notes From an Intermission)
Before continuing, let’ s pull back for a moment, and take a coffee break at a place called Spike and Smear. Souvenir Q records our notes. We’ ve been exploring the research program that appears to show, quite convincingly, that significance levels exaggerate the evidence against a null hypothesis, based on evidential assessments endorsed by various Bayesian and Likelihoodist accounts. We suspended the impulse to deny it can make sense to use a rival inference school to critique significance tests. We sought to explore if there’ s something to the cases they bring as ammunition to this conflict. The Bayesians say the disagreement between their numbers and P -values is relevant for impugning P -values, so we try to go along with them.
Reflect just on the first argument, pertaining to the case of two-sided Normal testing H 0 : μ = 0 vs. H 0 : μ ≠ 0, which was the most impressive, particularly with n ≥ 50. It showed that a statistically significant difference from a test hypothesis at familiar levels, 0.05 or 0.025, can correspond to a result that a Bayesian takes as evidence for H 0 . The prior for this case is the spike and smear, where the smear will be of the sort leading to J. Berger and Sellke’ s results, or similar. The test procedure is to move from a statistically significant result at the 0.025 level, say, and infer the posterior for H 0 .
Now our minimal requirement for data x to provide evidence for a claim H is that
(S-1) H accords with (agrees with) x , and
(S-2) there’ s a reasonable, preferably a high, probability that the procedure would have produced disagreement with H , if in fact H were false.
So let’ s apply these severity requirements to the data taken as evidence for H 0 here.
Consider (S-1). Is a result that is 1.96 or 2 standard errors away from 0 in good accord with 0? Well, 0 is excluded from the corresponding 95% confidence interval. That does not seem to be in accord with 0 at all. Still, they have provided measures whereby x does accord with H 0 , the likelihood ratio or posterior probability on H 0 . So, in keeping with the most useful and most generous way to use severity, let’ s grant (S-1) holds.
What about (S-2)? Has anything been done to probe the falsity of H 0 ? Let’ s allow that H 0 is not a precise point, but some very small set of values around 0. This is their example, and we’ re trying to give it as much credibility as possible. Did the falsity of H 0 have a good chance of showing itself? The falsity of H 0 here is H 1 : μ ≠ 0. What’ s troubling is that we found the probability of failing to pick up on population discrepancies as much as 1 standard error in excess of 0 is rather high (0.84) with n = 100. Larger sample sizes yield even less capability. Nor are they merely announcing “ no discrepancy from 0” in this case. They’ re finding evidence for 0!
So how did the Bayesian get the bump in posterior probability on the null? It was based on a spiked prior of 0.5 to H 0 . All the other points get minuscule priors having to share the remaining 0.5 probability. What was the warrant for the 0.5 prior to H 0 ? J. Berger and Sellke are quite upfront about it: if they allowed the prior spike to be low, then a rejection of the null would merely be showing an improbable hypothesis got more improbable. “ [W]ho, after all, would be convinced,” recall their asking: if “ my conclusion is that H 0 has posterior probability 0.05 and should be rejected” since it previously had probability, say 0.1 (1987 , p. 115). A slight lowering of probability won’ t cut it. Moving from a low prior to a slightly higher one also lacks punch.
This explains their high prior (at least 0.5) on H 0 , but is it evidence for it? Clearly not, nor does it purport to be. We needn’ t deny there are cases where a theoretical parameter value has passed severely (we saw this in the case of GTR in Excursion 3 ). But that’ s not what’ s happening here. Here they intend for the 0.5 prior to show, in general , that statistically significant results problematically exaggerate evidence. 8
A tester would be worried when the rationale for a spike is to avoid looking foolish when rejecting with a small drop; she’ d be worried too by a report: “ I don’ t take observing a mean temperature of 152 in your 100 water samples as indicating it’ s hotter than 150, because I give a whopping spike to our coolants being in compliance.” That is why Casella and R. Berger describe J. Berger and Sellke’ s spike and smear as maximally biased toward the null (1987a , p. 111). Don’ t forget the powerful role played by the choice of how to smear the 0.5 over the alternative! Bayesians might reassure us that the high Bayes factor for a point null doesn’ t depend on the priors given to H 0 and H 1 , when what they mean is that it depends only on the priors given to discrepancies under H 1 . It was the diffuse prior to the effect size that gave rise to the Jeffreys– Lindley Paradox. It affords huge latitude in what gets supported.
We thought we were traveling in testing territory; now it seems we’ ve drifted off to a different place. It shouldn’ t be easy to take data as evidence for a claim when that claim is false; but here it is easy (the claim here being H 0 ). How can this be one of a handful of main ways to criticize significance tests as exaggerating evidence? Bring in a navigator from a Popperian testing tribe before we all feel ourselves at sea:
Mere supporting instances are as a rule too cheap to be worth having … any support capable of carrying weight can only rest upon ingenious tests, undertaken with the aim of refuting our hypothesis, if it can be refuted.
(Popper 1983 , p. 130)
The high spike and smear tactic can’ t be take as a basis from which to launch a critique of significance tests because it fails rather glaringly a minimum requirement for evidence, let alone a test. We met Bayesians who don’ t approve of these tests either, and I’ ve heard it said that Bayesian testing is still a work in progress (Bernardo). Yet a related strategy is at the heart of some recommended statistical reforms.
4.5 Who’ s Exaggerating? How to Evaluate Reforms Based on Bayes Factor Standards
Edwards, Lindman, and Savage (E, L, & S) – who were perhaps first to raise this criticism – say this:
Imagine all the density under the alternative hypothesis concentrated at x , the place most favored by the data. …
Even the utmost generosity to the alternative hypothesis cannot make the evidence in favor of it as strong as classical significance levels might suggest.
(1963 , p. 228)
The example is the Normal testing case of J. Berger and Sellke, but they compare it to a one-tailed test of H 0 : μ = 0 vs. H 1 : μ = μ 1 = μ max (entirely sensibly in my view). We abbreviate H 1 by H max . Here the likelihood ratio Lik( μ max )/Lik( μ 0 ) = exp[z 2 /2]); the inverse is Lik( μ 0 )/Lik( μ max ) = exp[−z 2 /2]. I think the former makes their case stronger, yet you will usually see the latter. (I record their values in a Note 9 ). What is μ max ? It’ s the observed mean , the place most “ favored by the data.” In each case we consider as the result that is just statistically significant at the indicated P -value, or its standardized z form.
With a P -value of 0.025, H max is “ only” 6.84 times as likely as the null. I put quotes aroun
d “ only” not because I think 6.84 is big; I’ m never clear what’ s to count as big until I have information about the associated error probabilities. If you seek to ensure H max : μ = μ max is 28 times as likely as is H 0 : μ = μ 0 , you need to use a P -value ~0.005, with z value of 2.58, call it 2.6. Compare the corresponding error probabilities. Were there 0 discrepancy from the null, a difference smaller than 1.96 would occur 97.5% of the time; one smaller than 2.6 would occur 99.5% of the time. In both cases, the 95% two-sided and 97.5% confidence intervals entirely exclude 0. The two one-sided lower intervals are μ > 0 and μ > ~0.64. Both outcomes are good indications of μ > 0: the difference between the likelihood ratios 6.8 and 28 doesn’ t register as very much when it comes to indicating a positive discrepancy. Surely E, L, & S couldn’ t expect Bayes factors to match error probabilities when they are the ones who showed how optional stopping can alter the latter and not the former (Section 1.5 ).
Table 4.2 Upper Bounds on the Comparative Likelihood
P-value: one-sided Lik( μ max )/Lik( μ 0 )
0.05 1.65 3.87
0.025 1.96 6.84
0.01 2.33 15
0.005 2.58 28
0.0005 3.29 227
Valen Johnson (2013a ,b ) offers a way to bring the likelihood ratio more into line with what counts as strong evidence, according to a Bayes factor. He begins with a review of “ Bayesian hypotheses tests.” “ The posterior odds between two hypotheses H 1 and H 0 can be expressed as”
Like classical statistical hypothesis tests, the tangible consequence of a Bayesian hypothesis test is often the rejection of one hypothesis, say H 0 , in favor of the second, say H 1 . In a Bayesian test, the null hypothesis is rejected if the posterior probability of H 1 exceeds a certain threshold.