Equivalently, since the test statistic d( X ) :
Keep sampling until |d( X )| ≥ 1.96.
Our question was: would it be relevant to your evaluation of the evidence if you learned she’ d planned to keep running trials until reaching 1.96? Having failed to rack up a 1.96 difference after, say, 10 trials, she goes on to 20, and failing yet again, she goes to 30 and on and on until finally, say, on trial 169 she gets a 1.96 difference. Then she stops and declares the statistical significance is ~0.05.
This is an example of what’ s called a proper stopping rule : the probability it will stop in a finite number of trials is 1, regardless of the true value of μ . Thus, in one of the most seminal papers in statistical foundations, by Ward Edwards, Harold Lindman, and Leonard (Jimmie) Savage (E, L, & S) tell us, “ if an experimenter uses this procedure, then with probability 1 he will eventually reject any sharp null hypothesis, even though it be true” (1963 , p. 239). Understandably, they observe, the significance tester frowns on optional stopping, or at least requires the auditing of the P -value to require an adjustment. Had n been fixed, the significance level would be 0.05, but with optional stopping it increases.
Imagine instead if an account advertised itself as ignoring stopping rules. What if an account declared:
In general, suppose that you collect data of any kind whatsoever – not necessarily Bernoullian, nor identically distributed, nor independent of each other… – stopping only when the data thus far collected satisfy some criterion of a sort that is sure to be satisfied sooner or later, then the import of the sequence of n data actually observed will be exactly the same as it would be had you planned to take exactly n observations in the first place.
(ibid., pp. 238– 9)
I’ ve been teasing you, because these same authors who warn that to ignore stopping rules is to guarantee rejecting the null hypothesis even if it’ s true are the individuals who tout the irrelevance of stopping rules in the above citation – E, L, & S. They call it the Stopping Rule Principle . Are they contradicting themselves?
No. It is just that what looks to be, and indeed is, cheating from the significance testing perspective is not cheating from these authors’ Bayesian perspective. “ [F]requentist test results actually depend not only on what x was observed, but on how the experiment was stopped” (Carlin and Louis 2008 , p. 8). Yes, but shouldn’ t they? Take a look at Table 1.1 : by the time one reaches 50 trials, the probability of attaining a nominally significant 0.05 result is not 0.05 but 0.32. The actual or overall significance level is the probability of finding a 0.05 nominally significant result at some stopping point or other , up to the point it stops. The actual significance level accumulates.
Table 1.1 The effect of repeated significance tests (the “ try and try again” method)
Number of trials n Probability of rejecting H 0 with a result nominally significant at the 0.05 level at or before n trials, given H 0 is true
1 0.05
2 0.083
10 0.193
20 0.238
30 0.280
40 0.303
50 0.320
60 0.334
80 0.357
100 0.375
200 0.425
500 0.487
750 0.512
1000 0.531
Infinity 1.000
Well-known statistical critics from psychology, Joseph Simmons, Leif Nelson, and Uri Simonsohn, place at the top of their list of requirements the need to block flexible stopping: “ Researchers often decide when to stop data collection on the basis of interim data analysis … many believe this practice exerts no more than a trivial influence on false-positive rates” (Simmons et al. 2011 , p. 1361). “ Contradicting this intuition” they show the probability of erroneous rejections balloons. “ A researcher who starts with 10 observations per condition and then tests for significance after every new … observation finds a significant effect 22% of the time” erroneously (ibid., p. 1362). Yet the followers of the Stopping Rule Principle deny it makes a difference to evidence. On their account, it doesn’ t . It’ s easy to see why there’ s disagreement.
The Likelihood Principle
By what magic can such considerations disappear? One way to see the vanishing act is to hold, with Royall, that “ what the data have to say” is encompassed in likelihood ratios. This is the gist of a very important principle of evidence, the Likelihood Principle (LP). Bayesian inference requires likelihoods plus prior probabilities in hypotheses; but the LP has long been regarded as a crucial part of their foundation: to violate it is to be incoherent Bayesianly. Disagreement about the LP is a pivot point around which much philosophical debate between frequentists and Bayesians has turned. Here is a statement of the LP:
According to Bayes’ s Theorem, Pr( x | μ ) … constitutes the entire evidence of the experiment, that is it tells all that the experiment has to tell. More fully and more precisely, if y is the datum of some other experiment, and if it happens that Pr( x | μ ) and Pr( y | μ ) are proportional functions of μ (that is constant multiples of each other), then each of the two data x and y have exactly the same thing to say about the value of μ …
(Savage 1962 , p. 17; replace λ with µ )
Some go further and claim that if x and y give the same likelihood, “ they should give the same inference, analysis, conclusion, decision, action or anything else” (Pratt et al. 1995 , p. 542). Does the LP entail the LL? No. Bayesians, for example, generally hold to the LP, but would insist on priors that go beyond the LL. Even the converse may be denied (according to Hacking) but this is not of concern to us.
Weak Repeated Sampling Principle.
For sampling theorists (my error statisticians), by contrast, this example “ taken in the context of examining consistency with θ = 0, is enough to refute the strong likelihood principle” (Cox 1978 , p. 54), since, with probability 1, it will stop with a “ nominally” significant result even though θ = 0. It contradicts what Cox and Hinkley call “ the weak repeated sampling principle” (Cox and Hinkley 1974 , p. 51). “ [W]e should not follow procedures which for some possible parameter values would give, in hypothetical repetitions, misleading conclusions most of the time” (ibid., pp. 45– 6).
For Cox and Hinkley, to report a 1.96 standard deviation difference from optional stopping just the same as if the sample size had been fixed, is to discard relevant information for inferring inconsistency with the null, while “ according to any approach that is in accord with the strong likelihood principle, the fact that this particular stopping rule has been used is irrelevant” (ibid., p. 51). What they call the “ strong” likelihood principle will just be called the LP here. (A weaker form boils down to sufficiency, see Excursion 3 .)
Exhibit (ii): How Stopping Rules Drop Out.
Our question remains: by what magic can such considerations disappear? Formally, the answer is straightforward. Consider two versions of the above experiment: In the first, 1.96 is reached via fixed sample size (n = 169); in the second, by means of optional stopping that ended at 169. While d( x ) = d( y ) , because of the stopping rule, the likelihood of y differs from that of x by a constant k , that is,
Pr( x |Hi ) = k Pr( y |Hi ) for constant k .
Given that likelihoods enter as ratios, such proportional likelihoods are often said to be the “ same.” Now suppose inference is by Bayes’ Theorem. Since likelihoods enter as ratios, the constant k drops out. This is easily shown. I follow E, L, & S; p. 237.
For simplicity, suppose the possible hypotheses are exhausted by two, H 0 and H 1 , neither with probability of 0.
To show Pr(H 0 | y ) = Pr(H 0 | x ) :
(1) We are given the proportionality of likelihoods, for an arbitrary value of k :
Pr( y |H 0 ) = k Pr( x |H 0 ),
Pr( y |H 1 ) = k Pr( x |H 1 ).
(2) By definition:
The denominator Pr( y ) = Pr( y |H 0 ) Pr(H 0 ) + Pr( y |H 1 ) Pr(H 1 ) .
Now substitute for each term in (2) the proportionality cl
aims in (1). That is, replace Pr( y |H 0 ) with k Pr( x |H 0 ) and Pr( y |H 1 ) with k Pr( x |H 1 ) .
(3) The result is
The posterior probabilities are the same whether the 1.96 result emerged from optional stopping, Y , or fixed sample size, X .
This essentially derives the LP from inference by Bayes’ Theorem, and shows the equivalence for the particular case of interest, optional stopping. As always, when showing a Bayesian computation I use the conditional probability “ |” rather than the “ ;” of the frequentist. 3
The 1959 Savage Forum: What Counts as Cheating?
My colleague, well-known Bayesian I. J. Good, would state it as a “ paradox” :
[I]f a Fisherian is prepared to use optional stopping (which usually he is not) he can be sure of rejecting a true null hypothesis provided that he is prepared to go on sampling for a long time. The way I usually express this ‘ paradox’ is that a Fisherian [but not a Bayesian] can cheat by pretending he has a plane to catch like a gambler who leaves the table when he is ahead.
(Good 1983 , p. 135)
The lesson about who is allowed to cheat depends on your statistical philosophy. Error statisticians require that the overall and not the “ computed” significance level be reported. To them, cheating would be to report the significance level you got after trying and trying again in just the same way as if the test had a fixed sample size (Mayo 1996 , p. 351). Viewing statistical methods as tools for severe tests, rather than as probabilistic logics of evidence, makes a deep difference to the tools we seek. Already we find ourselves thrust into some of the knottiest and most intriguing foundational issues.
This is Jimmie Savage’ s message at a 1959 forum deemed sufficiently important to occupy a large gallery of the Museum of Statistics (hereafter “ The Savage Forum” (Savage 1962 )). Attendees include Armitage, Barnard, Bartlett, Cox, Good, Jenkins, Lindley, Pearson, Rubin, and Smith. Savage announces to this eminent group of statisticians that if adjustments in significance levels are required for optional stopping, which they are, then the fault must be with significance levels. Not all agreed. Needling Savage on this issue, was Peter Armitage:
I feel that if a man deliberately stopped an investigation when he had departed sufficiently far from his particular hypothesis, then ‘ Thou shalt be misled if thou dost not know that.’ If so, prior probability methods seem to appear in a less attractive light than frequency methods where one can take into account the method of sampling.
(Armitage 1962 , p. 72)
Armitage, an expert in sequential trials in medicine, is fully in favor of them, but he thinks stopping rules should be reflected in overall inferences. He goes further:
[Savage] remarked that, using conventional significance tests, if you go on long enough you can be sure of achieving any level of significance; does not the same sort of result happen with Bayesian methods?
(ibid., p. 72)
He has in mind using a type of uniform prior probability for μ , wherein the posterior for the null hypothesis matches the significance level. (We return to this in Excursion 6 . For σ = 1, its distribution is Normal( , 1/n ).)
Not all cases of trying and trying again injure error probabilities. Think of trying and trying again until you find a key that fits a lock. When you stop, there’ s no probability of being wrong. (We return to this in Excursion 4 .)
Savage’ s Sleight of Hand
Responding to Armitage, Savage engages in a bit of sleight of hand. Moving from the problematic example to one of two predesignated point hypotheses, H 0 : μ = μ 0 , and H 1 : μ = μ 1 , he shows that the error probabilities are controlled in that case. In particular, the probability of obtaining a result that makes H 1 r times more likely than H 0 is less than 1/r : Pr(LR > r; H 0 ) < 1/r . But, that wasn’ t Armitage’ s example; nor does Savage return to it. Now, it is open to Likelihoodists to resist being saddled “ with ideas that are alien to them” (Sober 2008 , p. 77). Since the Likelihoodist keeps to this type of comparative appraisal, they can set bounds to the probabilities of error. However, the bounds are no longer impressively small as we add hypotheses, even if they are predesignated 4 (Mayo and Kruse 2001 ).
Something more revealing is going on when the Likelihoodist sets pre-data bounds. Why the sudden concern with showing the rule for comparative evidence would very improbably find evidence in favor of the wrong hypothesis? This is an error probability. So it appears they also care about error probabilities – at least before-trial – or they are noting, for those of us who do, that they also have error control in the simple case of predesignated point hypotheses. The severe tester asks: If you want to retain these pre-data safeguards, why allow them to be spoiled by data-dependent hypotheses and stopping rules?
Some have said: the evidence is the same, but you take into account things like stopping rules and data-dependent selections afterwards . When making an inference, this is afterwards, and we need an epistemological rationale to pick up on their influences now . Perhaps knowing someone uses optional stopping warrants a high belief he’ s trying to deceive you, leading to a high enough prior belief in the null. Maybe so, but this is to let priors reflect methods in a non-standard way. Besides, Savage (1961, p. 583) claimed optional stopping “ is no sin,” so why should it impute deception? So far as I know, subjective Bayesians have resisted the idea that rules for stopping alter the prior. Couldn’ t you pack the concern in some background B ? You could, but you would need another account to justify doing so, thereby only pushing back the issue. I’ ve discussed an assortment of attempts elsewhere: Mayo (1996 ), Mayo and Kruse (2001 ), Mayo (2014b ). Others have too, discussed here and elsewhere; please see our online sources (preface).
Arguments from Intentions: All in Your Head?
A funny thing happened at the Savage Forum: George Barnard announces he no longer holds the LP for the two-sided test under discussion, only for the predesignated point alternatives. Savage is shocked to hear it:
I learned the stopping rule principle from Professor Barnard, in conversation in the summer of 1952. Frankly, I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe that some people resist an idea so patently right.
(Savage 1962 , p. 76)
The argument Barnard gave him was that the plan for when to stop was a matter of the researchers’ intentions, all wrapped up in their heads. While Savage denies he was ever sold on the argument from intentions, it’ s a main complaint you will hear about taking account, not just of stopping rules, but of error probabilities in general. Take the subjective Bayesian philosophers Howson and Urbach (1993 ):
A significance test inference, therefore, depends not only on the outcome that a trial produced, but also on the outcomes that it could have produced but did not. And the latter are determined by certain private intentions of the experimenters, embodying their stopping rule. It seems to us that this fact precludes a significance test delivering any kind of judgment about empirical support.
(p. 212)
The truth is, whether they’ re hidden or not turns on your methodology being able to pick up on them. So the deeper question is: ought your account pick up on them?
The answer isn’ t a matter of mathematics, it depends on your goals and perspective – yes on your philosophy of statistics. Ask yourself: What features lead you to worry about cherry picking, and selective reporting? Why do the CONSORT and myriad other best practice manuals care? Looking just at the data and hypotheses – as a “ logic” of evidence would – you will not see the machinations. Nevertheless, these machinations influence the capabilities of the tools. Much of the handwringing about irreproducibility is the result of wearing blinders as to the construction and selection of both hypotheses and data. In one sense, all test specifications are determined by a researcher’ s intentions; that doesn’ t make them private or invisible to us. They’ re visible to accounts with antennae to pick up on them!
You might try to deflec
t the criticism of stopping rules by pointing out that some stopping rules do alter priors. Armitage wasn’ t ignoring that, nor are we. These are called informative stopping rules, and examples are rather contrived. For instance, “ a man who wanted to know how frequently lions watered at a certain pool was chased away by lions” (E, L, & S 1963 , p. 239). They add, “ we would not give a facetious example had we been able to think of a serious one.” In any event, this is irrelevant for the Armitage example, which is non-informative.
Error Probabilities Violate the LP
[I]t seems very strange that a frequentist could not analyze a given set of data, such as ( x 1 , … , x n ) if the stopping rule is not given … [D]ata should be able to speak for itself.
(Berger and Wolpert 1988 , p. 78)
Inference by Bayes’ Theorem satisfies this intuition, which sounds appealing; but for our severe tester, data no more speak for themselves in the case of stopping rules than with cherry picking, hunting for significance, and the like. We may grant to the Bayesian that
[The] irrelevance of stopping rules to statistical inference restores a simplicity and freedom to experimental design that had been lost by classical emphasis on significance levels (in the sense of Neyman and Pearson).
(E, L, & S 1963 , p. 239)
The question is whether this latitude is desirable. If you are keen to use statistical methods critically, as our severe tester, you’ ll be suspicious of a simplicity and freedom to mislead.
Admittedly, this should have been more clearly spelled out by Neyman and Pearson. They rightly note:
Statistical Inference as Severe Testing Page 8