Statistical Inference as Severe Testing
Page 7
Why Does the LL Reject Composite Hypotheses?
Royall tells us that his account is unable to handle composite hypotheses, even this one (for which there is a uniformly most powerful [UMP] test over all points in H 0 ). He does not conclude that his test comes up short. He and other Likelihoodists maintain that any genuine test or “ rule of rejection” should be restricted to comparing the likelihood of H versus some point alternative H′ relative to fixed data x (Royall 1997 , pp. 19– 20). It is a virtue. No wonder the Likelihoodist disagrees with the significance tester. In their view, a simple significance test is not a “ real” testing account because it is not a comparative appraisal. Elliott Sober, a well-known philosopher of science, echoes Royall: “ The fact that significance tests don’ t contrast the null with alternatives suffices to show that they do not provide a good rule for rejection” (Sober 2008 , p. 56). Now, Royall’ s significance test has an alternative H 1 : θ > 0.2! It’ s just not a point alternative but is compound or composite (including all values greater than 0.2). The form of inference, admittedly, is not of the comparative (“ evidence favoring” ) variety. In this discussion, H 0 and H 1 replace his H 1 and H 2 .
What untoward consequences occur if we consider composite hypotheses (according to the Likelihoodist)? The problem is that even though the likelihood of θ = 0.2 is small, there are values within alternative H 1 : θ > 0.2 that are even less likely on the data . For instance consider θ = 0.9 .
[B]ecause H 0 contains some simple hypotheses that are better supported than some hypotheses in H 1 (e.g., θ = 0.2 is better supported than θ = 0.9 by a likelihood ratio of LR = (0.2/0.9)9 (0.8/0.1)8 = 22.2), the law of likelihood does not allow the characterization of these observations as strong evidence for H 1 over H 0 .
(Royall 1997 , p. 20)
For Royall, rejecting H 0 : θ ≤ 0.2 and inferring H 1 : θ > 0.2 is to assert every parameter point within H 1 is more likely than every point in H 0 . That seems an idiosyncratic meaning to attach to “ infer evidence of θ > 0.2” ; but it explains this particular battle. It still doesn’ t explain the alleged problem for the significance tester who just takes it to mean what it says:
To reject H 0 : θ ≤ 0.2 is to infer some positive discrepancy from 0.2.
We readily agree with Royall that there’ s a problem with taking a rejection of H 0 : θ ≤ 0.2 , with , as evidence of a discrepancy as large as θ = 0.9 . It’ s terrible evidence even that θ is as large as 0.7 or 0.8. Here’ s how a tester articulates this terrible evidence.
Consider the test rule: infer evidence of a discrepancy from 0.2 as large as 0.9, based on observing . The data differ from 0.2 in the direction of H 1 , but to take that difference as indicating an underlying θ > 0.9 would be wrong with probability ~1. Since the standard error of the mean, , is 0.1, alternative 0.9 is more than 3 greater than 0.53. ( = σ /√ n ) The inference gets low severity.
We’ ll be touring significance tests and confidence bounds in detail later. We’ re trying now to extract some core contrasts between error statistical methods and logics of evidence such as the LL. According to the LL, so long as there is a point within H 1 that is less likely given x than is H 0 , the data are “ evidence in favor of the null hypothesis, not evidence against it” (Sober 2008 , pp. 55– 6). He should add “ as compared to” some less likely alternative. We never infer a statistical hypothesis according to the LL, but rather a likelihood ratio of two hypotheses, neither of which might be likely. The significance tester and the comparativist hold very different images of statistical inference.
Can an account restricted to comparisons answer the questions: is x good evidence for H ? Or is it a case of bad evidence, no test? Royall says no. He declares that all attempts to say whether x is good evidence for H , or even if x is better evidence for H than is y , are utterly futile. Similarly, “ What does the [LL] say when one hypothesis attaches the same probability to two different observations? It says absolutely nothing … [it] applies when two different hypotheses attach probabilities to the same observation” (Royall 2004 , p. 148). That cuts short important tasks of inferential scrutiny. Since model checking concerns the adequacy of a single model, the Likelihoodist either forgoes such checks or must go beyond the paradigm.
Still, if the model can be taken as adequate, and the Likelihoodist gives a sufficiently long list of comparisons, the differences between us don’ t seem so marked. Take Royall:
One statement that we can make is that the observations are only weak evidence in favor of θ = 0.8 versus θ = 0.2 (LR = 4) … and at least moderately strong evidence for θ = 0.5 over any value θ > 0.8 (LR) > 22) .
(1977, p. 20)
Nonetheless, we’ d want to ask: what do these numbers mean? Is 22 a lot? Is 4 small? We’ re back to Hacking’ s attempt to compare tank cars with widths of a grating. How do we calibrate them? Neyman and Pearson’ s answer, we’ ll see, is to look at the probability of so large a likelihood ratio, under various hypotheses, as in (*).
LRs and Posteriors.
Royall is loath to add prior probabilities to the assessment of the import of the evidence. This, he says, allows the LR to be “ a precise and objective numerical measure of the strength of evidence” in comparing hypotheses (2004, p. 123). At the same time, Royall argues, the LL “ constitutes the essential core of the Bayesian account of evidence … the Bayesian who rejects the [LL] undermines his own position” (ibid., p. 146). The LR, after all, is the factor by which the ratio of posterior probabilities is changed by the data. Consider just two hypotheses, switching from the “ ;” in the significance test to conditional probability “ |” : 1
Likewise:
The denominators equal Pr( x ), so they cancel in the LR:
All of this assumes the likelihoods and the model are deemed adequate.
Data Dredging: Royall Bites the Bullet
Return now to our most serious problem: The Law of Likelihood permits finding evidence in favor of a hypothesis deliberately arrived at using the data, even in the extreme case that it is Gellerized. Allan Birnbaum, who had started out as a Likelihoodist, concludes, “ the likelihood concept cannot be construed so as to allow useful appraisal, and thereby possible control, of probabilities of erroneous interpretations” (Birnbaum 1969 , p. 128). But Royall has a clever response. Royall thinks control of error probabilities arises only in answering his second question about action, not evidence. He is prepared to bite the bullet. He himself gives the example of a “ trick deck.” You’ ve shuffled a deck of ordinary-looking playing cards; you turn over the top card and find an ace of diamonds:
According to the law of likelihood, the hypothesis that the deck consists of 52 aces of diamonds (H 1 ) is better supported than the hypothesis that the deck is normal (H N ) [by the factor 52] … Some find this disturbing.
(Royall 1997 , pp. 13– 14)
Royall does not. He admits:
… it seems unfair; no matter what card is drawn, the law implies that the corresponding trick-deck hypothesis (52 cards just like the one drawn) is better supported than the normal-deck hypothesis. Thus even if the deck is normal we will always claim to have found strong evidence that it is not.
(ibid.)
What he is admitting then is, given any card:
Pr(LR favors trick deck hypothesis; normal deck) = 1.
Even though different trick deck hypotheses would be formed for different outcomes, we may compute the sampling distribution (*). The severity for “ trick deck” would be 0. It need not be this extreme to have BENT results, but you get the idea.
What’ s Royall’ s way out? At the level of a report on comparative likelihoods, Royall argues, there’ s no need for a way out. To Royall, it only shows a confusion between evidence and belief. 2 If you’ re not convinced the deck has 52 aces of diamonds rather than being a normal deck “ it does not mean that the observation is not strong evidence in favor of H 1 versus H N ” where H N is a normal deck (ibid., p. 14). It just wasn’ t strong enough to overcome you
r prior beliefs. If you regard the maximally likely alternative as unpalatable, you should have given it a suitably low prior degree of probability. The more likely hypothesis is still favored on grounds of evidence, but your posterior belief may be low. Don’ t confuse evidence with belief! For the question of evidence, your beliefs have nothing to do with it, according to Royall’ s Likelihoodist.
What if we grant the Likelihoodist this position? What do we do to tackle the essential challenge to the credibility of statistical inference today, when it’ s all about Texas Marksmen, hunters, snoopers, and cherry pickers? These moves, which play havoc with a test’ s ability to control erroneous interpretations, do not alter the evidence at all, say Likelihoodists. The fairest reading of Royall’ s position might be this: the data indicate only the various LRs. If they are the same, it matters not whether hypotheses arose through data dredging – at least, so long as you are in the category of “ what the data say.” As soon as you’ re troubled, you slip into the category of belief. What if we’ re troubled by the ease of exaggerating findings when you’ re allowed to rummage around? What if we wish to clobber the Texas sharpshooter method, never mind my beliefs in the particular claims they infer. You might aver, we should never be considering trick deck hypotheses, but this is the example Royall gives, and he is a, if not the, leading Likelihoodist.
To him, appealing to error probabilities is relevant only pre-data, which wouldn’ t trouble the severe tester so much if Likelihoodists didn’ t regard them as relevant only for a performance goal, not inference. Given that frequentists have silently assented to the performance use of error probabilities, it’ s perhaps not surprising that others accept this. The problem with cherry picking is not about long runs, it’ s that a poor job has been done in the case at hand. The severity requirement reflects this intuition. By contrast, Likelihoodists hold that likelihood ratios, and unadjusted P -values, still convey what the data say, even with claims arrived at through data dredging. It’ s true you can explore, arrive at H , then test H on other data; but isn’ t the reason there’ s a need to test on new data that your assessment will otherwise fail to convey how well tested H is?
Downsides to the “ Appeal to Beliefs” Solution to Inseverity
What’ s wrong with Royall’ s appeal to prior beliefs to withhold support to a “ just so” hypothesis? It may get you out of a jam in some cases. Here’ s why the severe tester objects. First, she insists on distinguishing the evidential warrant for one and the same hypothesis H in two cases: one where it was constructed post hoc, cherry picked, and so on, a second where it was predesignated. A cherry-picked hypothesis H could well be believable, but we’ d still want to distinguish the evidential credit H deserves in the two cases. Appealing to priors can’ t help, since here there’ s one and the same H . Perhaps someone wants to argue that the mode of testing alters the degree of belief in H , but this would be non-standard (violating the Likelihood Principle to be discussed shortly). Philosopher Roger Rosenkrantz puts it thus: The LL entails the irrelevance “ of whether the theory was formulated in advance or suggested by the observations themselves” (Rosenkrantz 1977 , p. 121). For Rosenkrantz, a default Bayesian last I checked, this irrelevance of predesignation is altogether proper. By contrast, he admits, “ Orthodox (non-Bayesian) statisticians have found this to be strong medicine indeed!” (ibid.). Many might say instead that it is bad medicine. Take, for instance, something called the CONSORT, the Consolidated Standards of Reporting Trials from RCTs in medicine:
Selective reporting of outcomes is widely regarded as misleading. It undermines the validity of findings, particularly when driven by statistical significance or the direction of the effect [4], and has memorably been described in the New England Journal of Medicine as “ Data Torturing” [5].
(COMpare Team 2015 )
This gets to a second problem with relying on beliefs to block data-dredged hypotheses. Post-data explanations, even if it took a bit of data torture, are often incredibly convincing, and you don’ t have to be a sleaze to really believe them. Goldacre (2016 ) expresses shock that medical journals continue to report outcomes that were altered post-data – he calls this outcome-switching . Worse, he finds, some journals defend the practice because they are convinced that their very good judgment entitles them to determine when to treat post-designated hypotheses as if they were predesignated. Unlike the LL, the CONSORT and many other best practice guides view these concerns as an essential part of reporting what the data say. Now you might say this is just semantics, as long as, in the end, they report that outcome-switching occurred. Maybe so, provided the report mentions why it would be misleading to hide the information. At least people have stopped referring to frequentist statistics as “ Orthodox.”
There is a third reason to be unhappy with supposing the only way to block evidence for “ just so” stories is by the deus ex machina of a low prior degree of belief: it misidentifies what the problem really is . The influence of the biased selection is not on the believability of H but rather on the capability of the test to have unearthed errors. The error probing capability of the testing procedure is being diminished. If you engage in cherry picking, you are not “ sincerely trying,” as Popper puts it, to find flaws with claims, but instead you are finding evidence in favor of a well-fitting hypothesis that you deliberately construct – barred only if your intuitions say it’ s unbelievable. The job that was supposed to be accomplished by an account of statistics now has to be performed by you . Yet you are the one most likely to follow your preconceived opinions, biases, and pet theories. If an account of statistical inference or evidence doesn’ t supply self-critical tools, it comes up short in an essential way. So says the severe tester.
Souvenir B: Likelihood versus Error Statistical
Like pamphlets from competing political parties, the gift shop from this tour proffers pamphlets from these two perspectives.
To the Likelihoodist, points in favor of the LL are:
The LR offers “ a precise and objective numerical measure of the strength of statistical evidence” for one hypotheses over another; it is a frequentist account and does not use prior probabilities (Royall 2004 , p. 123).
The LR is fundamentally related to Bayesian inference: the LR is the factor by which the ratio of posterior probabilities is changed by the data.
A Likelihoodist account does not consider outcomes other than the one observed, unlike P -values, and Type I and II errors. (Irrelevance of the sample space.)
Fishing for maximally fitting hypotheses and other gambits that alter error probabilities do not affect the assessment of evidence; they may be blocked by moving to the “ belief” category.
To the error statistician, problems with the LL include:
LRs do not convey the same evidential appraisal in different contexts.
The LL denies it makes sense to speak of how well or poorly tested a single hypothesis is on evidence, essential for model checking; it is inapplicable to composite hypothesis tests.
A Likelihoodist account does not consider outcomes other than the one observed, unlike P -values, and Type I and II errors. (Irrelevance of the sample space.)
Fishing for maximally fitting hypotheses and other gambits that alter error probabilities do not affect the assessment of evidence; they may be blocked by moving to the “ belief” category.
Notice, the last two points are identical for both. What’ s a selling point for a Likelihoodist is a problem for an error statistician.
1.5 Trying and Trying Again: The Likelihood Principle
The likelihood principle emphasized in Bayesian statistics implies, among other things, that the rules governing when data collection stops are irrelevant to data interpretation.
(Edwards, Lindman, and Savage 1963 , p. 193)
Several well-known gambits make it altogether easy to find evidence in support of favored claims, even when they are unwarranted. A responsible statistical inference report requires information about whether the met
hod used is capable of controlling such erroneous interpretations of data or not. Now we see that adopting a statistical inference account is also to buy into principles for processing data, hence criteria for “ what the data say,” hence grounds for charging an inference as illegitimate, questionable, or even outright cheating. The best way to survey the landscape of statistical debates is to hone in on some pivotal points of controversy – saving caveats and nuances for later on.
Consider for example the gambit of “ trying and trying again” to achieve statistical significance, stopping the experiment only when reaching a nominally significant result. Kosher, or not? Suppose somebody reports data showing a statistically significant effect, say at the 0.05 level. Would it matter to your appraisal of the evidence if you found out that each time they failed to find significance, they went on to collect more data, until finally they did? A rule for when to stop sampling is called a stopping rule .
The question is generally put by considering a random sample X that is Normally distributed with mean μ and standard deviation σ = 1, and we are testing the hypotheses:
H 0 : μ = 0 against H 1 : μ ≠ 0.
This is a two-sided test: a discrepancy in either direction is sought. (The details of testing are in Excursions 3 and thereafter.) To ensure a significance level of 0.05, H0 is rejected whenever the sample mean differs from 0 by more than 1.96 σ /√ n , and, since σ = 1, the rule is: Declare x is statistically significant at the 0.05 level whenever | | > 1.96/√ n . However, instead of fixing the sample size in advance, n is determined by the optional stopping rule: