Decoupling.
On the horizon is the idea that statistical methods may be decoupled from the philosophies in which they are traditionally couched. In an attempted meeting of the minds (Bayesian and error statistical), Andrew Gelman and Cosma Shalizi (2013 ) claim that “ implicit in the best Bayesian practice is a stance that has much in common with the error-statistical approach of Mayo” (p. 10). In particular, Bayesian model checking, they say, uses statistics to satisfy Popperian criteria for severe tests . The idea of error statistical foundations for Bayesian tools is not as preposterous as it may seem. The concept of severe testing is sufficiently general to apply to any of the methods now in use. On the face of it, any inference, whether to the adequacy of a model or to a posterior probability, can be said to be warranted just to the extent that it has withstood severe testing. Where this will land us is still futuristic.
Why Our Journey?
We have all, or nearly all, moved past these old [Bayesian-frequentist] debates, yet our textbook explanations have not caught up with the eclecticism of statistical practice.
(Kass 2011 , p. 1)
When Kass proffers “ a philosophy that matches contemporary attitudes,” he finds resistance to his big tent. Being hesitant to reopen wounds from old battles does not heal them. Distilling them in inoffensive terms just leads to the marshy swamp. Textbooks can’ t “ catch-up” by soft-peddling competing statistical accounts. They show up in the current problems of scientific integrity, irreproducibility, questionable research practices, and in the swirl of methodological reforms and guidelines that spin their way down from journals and reports.
From an elevated altitude we see how it occurs. Once high-profile failures of replication spread to biomedicine, and other “ hard” sciences, the problem took on a new seriousness. Where does the new scrutiny look? By and large, it collects from the earlier social science “ significance test controversy” and the traditional philosophies coupled to Bayesian and frequentist accounts, along with the newer Bayesian– frequentist unifications we just surveyed. This jungle has never been disentangled. No wonder leading reforms and semi-popular guidebooks contain misleading views about all these tools. No wonder we see the same fallacies that earlier reforms were designed to avoid, and even brand new ones. Let me be clear, I’ m not speaking about flat-out howlers such as interpreting a P -value as a posterior probability. By and large, they are more subtle; you’ ll want to reach your own position on them. It’ s not a matter of switching your tribe, but excavating the roots of tribal warfare. To tell what’ s true about them. I don’ t mean understand them at the socio-psychological levels, although there’ s a good story there (and I’ ll leak some of the juicy parts during our travels).
How can we make progress when it is difficult even to tell what is true about the different methods of statistics? We must start afresh, taking responsibility to offer a new standpoint from which to interpret the cluster of tools around which there has been so much controversy. Only then can we alter and extend their limits. I admit that the statistical philosophy that girds our explorations is not out there ready-made; if it was, there would be no need for our holiday cruise. While there are plenty of giant shoulders on which we stand, we won’ t be restricted by the pronouncements of any of the high and low priests, as sagacious as many of their words have been. In fact, we’ ll brazenly question some of their most entrenched mantras. Grab on to the gondola, our balloon’ s about to land.
In Tour II , I’ ll give you a glimpse of the core behind statistics battles, with a firm promise to retrace the steps more slowly in later trips.
1 This contrasts with the use of “ metaresearch” to describe work on methodological reforms by non-philosophers. This is not to say they don’ t tread on philosophical territory often: they do.
2 This is the traditional use of “ bias” as a systematic error. Ioannidis (2005 ) alludes to biasing as behaviors that result in a reported significance level differing from the value it actually has or ought to have (e.g., post-data endpoints, selective reporting). I will call those biasing selection effects.
Tour II
Error Probing Tools versus Logics of Evidence
1.4 The Law of Likelihood and Error Statistics
If you want to understand what’ s true about statistical inference, you should begin with what has long been a holy grail – to use probability to arrive at a type of logic of evidential support – and in the first instance you should look not at full-blown Bayesian probabilism, but at comparative accounts that sidestep prior probabilities in hypotheses. An intuitively plausible logic of comparative support was given by the philosopher Ian Hacking (1965 ) – the Law of Likelihood. Fortunately, the Museum of Statistics is organized by theme, and the Law of Likelihood and the related Likelihood Principle is a big one.
Law of Likelihood (LL): Data x are better evidence for hypothesis H 1 than for H 0 if x is more probable under H 1 than under H 0 : Pr( x ; H 1 ) > Pr( x ; H 0 ), that is, the likelihood ratio (LR) of H 1 over H 0 exceeds 1.
H 0 and H 1 are statistical hypotheses that assign probabilities to values of the random variable X. A fixed value of X is written x 0 , but we often want to generalize about this value, in which case, following others, I use x . The likelihood of the hypothesis H , given data x , is the probability of observing x , under the assumption that H is true or adequate in some sense. Typically, the ratio of the likelihood of H 1 over H 0 also supplies the quantitative measure of comparative support. Note, when X is continuous, the probability is assigned over a small interval around X , to avoid probability 0.
Does the Law of Likelihood Obey the Minimal Requirement for Severity?
Likelihoods are vital to all statistical accounts, but they are often misunderstood because the data are fixed and the hypothesis varies. Likelihoods of hypotheses should not be confused with their probabilities. Two ways to see this. First, suppose you discover all of the stocks in Pickrite’ s promotional letter went up in value ( x ) – all winners. A hypothesis H to explain this is that their method always succeeds in picking winners. H entails x , so the likelihood of H given x is 1. Yet we wouldn’ t say H is therefore highly probable, especially without reason to put to rest that they culled the winners post hoc. For a second way, at any time, the same phenomenon may be perfectly predicted or explained by two rival theories; so both theories are equally likely on the data, even though they cannot both be true.
Suppose Bristol-Roach, in our Bernoulli tea tasting example, got two correct guesses followed by one failure. The observed data can be represented as x 0 = ⟨ 1,1,0⟩ . Let the hypotheses be different values for θ , the probability of success on each independent trial. The likelihood of the hypothesis H 0 : θ = 0.5, given x 0 , which we may write as Lik(0.5), equals (1/2)(1/2)(1/2) = 1/8. Strictly speaking, we should write Lik(θ ; x 0 ), because it’ s always computed given data x 0 ; I will do so later on. The likelihood of the hypothesis θ = 0.2 is Lik(0.2) = (0.2)(0.2)(0.8) = 0.032. In general, the likelihood in the case of Bernoulli independent and identically distributed trials takes the form: Lik( θ )= θ s (1 − θ ) f , 0 < θ < 1, where s is the number of successes and f the number of failures. Infinitely many values for θ between 0 and 1 yield positive likelihoods; clearly then, likelihoods do not sum to 1, or any number in particular. Likelihoods do not obey the probability calculus.
The Law of Likelihood (LL) will immediately be seen to fail our minimal severity requirement – at least if it is taken as an account of inference. Why? There is no onus on the Likelihoodist to predesignate the rival hypotheses – you are free to search, hunt, and post-designate a more likely, or even maximally likely, rival to a test hypothesis H 0 .
Consider the hypothesis that θ = 1 on trials one and two and 0 on trial three. That makes the probability of x maximal. For another example, hypothesize that the observed pattern would always recur in three-trials of the experiment (I. J. Good said in his cryptoanalysis work these were called “ kinkera” ). Hunting for an impressi
ve fit, or trying and trying again, one is sure to find a rival hypothesis H 1 much better “ supported” than H 0 even when H 0 is true. As George Barnard puts it, “ there always is such a rival hypothesis, viz. that things just had to turn out the way they actually did” (1972 , p. 129).
Note that for any outcome of n Bernoulli trials, the likelihood of H 0 : θ = 0.5 is (0.5) n , so is quite small. The likelihood ratio (LR) of a best-supported alternative compared to H 0 would be quite high. Since one could always erect such an alternative,
(*) Pr(LR in favor of H 1 over H 0 ; H 0 ) = maximal.
Thus the LL permits BENT evidence . The severity for H 1 is minimal, though the particular H 1 is not formulated until the data are in hand. I call such maximally fitting, but minimally severely tested, hypotheses Gellerized , since Uri Geller was apt to erect a way to explain his results in ESP trials. Our Texas sharpshooter is analogous because he can always draw a circle around a cluster of bullet holes, or around each single hole. One needn’ t go to such an extreme rival, but it suffices to show that the LL does not control the probability of erroneous interpretations.
What do we do to compute (*)? We look beyond the specific observed data to the behavior of the general rule or method, here the LL. The output is always a comparison of likelihoods. We observe one outcome, but we can consider that for any outcome, unless it makes H 0 maximally likely, we can find an H 1 that is more likely. This lets us compute the relevant properties of the method: its inability to block erroneous interpretations of data. As always, a severity assessment is one level removed: you give me the rule, and I consider its latitude for erroneous outputs. We’ re actually looking at the probability distribution of the rule, over outcomes in the sample space. This distribution is called a sampling distribution. It’ s not a very apt term, but nothing has arisen to replace it. For those who embrace the LL, once the data are given, it’ s irrelevant what other outcomes could have been observed but were not. Likelihoodists say that such considerations make sense only if the concern is the performance of a rule over repetitions, but not for inference from the data. Likelihoodists hold to “ the irrelevance of the sample space” (once the data are given). This is the key contrast between accounts based on error probabilities (error statistical accounts) and logics of statistical inference.
Hacking “ There is No Such Thing as a Logic of Statistical Inference”
Hacking’ s (1965 ) book was so ahead of its time that by the time philosophers of science started to get serious about philosophy of statistics, he had already broken the law he had earlier advanced. Hacking (1972 , 1980 ) admits to having been caught up in the “ logicist” mindset wherein we assume a logical relationship exists between any data and hypothesis; and even denies (1980 , p. 145) there is any such thing.
In his review of A. F. Edwards’ (1972 ) book Likelihood , Hacking (1972 ) gives his main reasons for rejecting the LL:
We capture enemy tanks at random and note the serial numbers on their engines. We know the serial numbers start at 0001. We capture a tank number 2176. How many did the enemy make? On the likelihood analysis, the best-supported guess is: 2176. Now one can defend this remarkable result by saying that it does not follow that we should estimate the actual number as 2176 only that comparing individual numbers, 2176 is better supported than any larger figure. My worry is deeper. Let us compare the relative likelihood of the two hypotheses, 2176 and 3000. Now pass to a situation where we are measuring, say, widths of a grating in which error has a normal distribution with known variance; we can devise data and a pair of hypotheses about the mean which will have the same log-likelihood ratio. I have no inclination to say that the relative support in the tank case is ‘ exactly the same as’ that in the normal distribution case, even though the likelihood ratios are the same.
(pp. 136– 7)
Likelihoodists will insist that the law may be upheld by appropriately invoking background information, and by drawing distinctions between evidence, belief, and action.
Royall’ s Road to Statistical Evidence
Statistician Richard Royall, a longtime leader of Likelihoodist tribes, has had a deep impact on current statistical foundations. His views are directly tied to recent statistical reforms – even if those reformers go Bayesian rather than stopping, like Royall, with comparative likelihoods. He provides what many consider a neat proposal for settling disagreements about statistical philosophy. He distinguishes three questions: belief, action, and evidence:
1. What do I believe, now that I have this observation?
2. What should I do, now that I have this observation?
3. How should I interpret this observation as evidence regarding [H 0 ] versus [H 1 ]?
(Royall 1997 , p. 4)
Can we line up these three goals to my probabilism, performance, and probativeness (Section 1.2)? No. Probativeness gets no pigeonhole. According to Royall, what to believe is captured by Bayesian posteriors, how to act is captured by a frequentist performance (in some cases he will add costs). What’ s his answer to the evidence question? The Law of Likelihood.
Let’ s use one of Royall’ s first examples, appealing to Bernoulli distributions again – independent, dichotomous trials, “ success” or “ failure” :
Medical researchers are interested in the success probability, θ , associated with a new treatment. They are particularly interested in how θ relates to the old treatment’ s success probability, believed to be about 0.2. They have reason to hope that θ is considerably greater, perhaps 0.8 or even greater.
(Royall 1997 , p. 19)
There is a set of possible outcomes, a sample space, S, and a set of possible parameter values, a parameter space Ω . He considers two hypotheses:
θ = 0.2 and θ = 0.8.
These are simple or point hypotheses. To illustrate take a miniature example with only n = 4 trials where each can be a “ success” {X = 1} or a “ failure” {X = 0}. A possible result might be x 0 = ⟨ 1,1,0,1⟩ . Since Pr(X = 1) = θ and Pr(X = 0) = (1 − θ ) , the probability of x 0 is ( θ )( θ )(1 − θ )( θ ). Given independent trials, they multiply. Under the two hypotheses, given ⟨ 1,1,0,1⟩ , the likelihoods are
Lik(H 0 ) = (0.2)(0.2)(0.8)(0.2) = 0.0064,
Lik(H 1 ) = (0.8)(0.8)(0.2)(0.8) = 0.1024.
A hypothesis that would make the data most probable would be that θ = 1, on the three trials that yield successes, and 0 where it yields failure.
We typically denigrate “ just so” stories, purposely erected to fit the data, as “ unlikely.” Yet they are most likely in the technical sense! So in hearing likelihood used formally, you must continually keep this swap of meanings in mind. (We call them Gellerized only if they pass with minimal severity.) If θ is to be constant on each trial, as in the Bernoulli model, the maximum likely hypothesis equates θ with the relative frequency of success, 0.75. [Exercise for reader: find Lik (0.75)]
Exhibit (i): Law of Likelihood Compared to a Significance Test.
Here Royall contrasts his handling of the medical example to the standard significance test:
A standard statistical analysis of their observations would use a Bernoulli ( θ ) statistical model and test the composite hypotheses H 0 : θ ≤ 0.2 versus H 1 : θ > 0.2. That analysis would show that H 0 can be rejected in favor of H 1 at any significance level greater than 0.003, a result that is conventionally taken to mean that the observations are very strong evidence supporting H 1 over H 0 .
(Royall 1997 , p. 19; substituting H 0 and H 1 for H 1 and H 2 .)
So the significance tester looks at the composite hypotheses H 0 : θ ≤ 0.2 vs. H 1 : θ > 0.2, rather than his point hypotheses θ = 0.2 and θ = 0.8. Here, she would look at how much larger the mean success rate is in the sample , which we abbreviate as , compared to what is expected under H 0 , put in standard deviation units. Using Royall’ s numbers, the observed success rate is
The test statistic d( X ) is ; it gets larger and larger the more the data deviate from what is expected under H 0 �
� as is sensible for a good test statistic. Its value is
d( x 0 ) = √ 17 (0.53 – 0.2)/ 0.4 ≃ 3.3.
The significance level associated with d( x 0 ) is
Pr(d( X ) ≥ d( x ); H 0 ) ≃ 0.003.
This is read, “ the probability d( X ) would be at least as large as the particular value d( x 0 ), under the supposition that H 0 adequately describes the data generation procedure” (see Souvenir C). It’ s not strictly a conditional probability – a subtle point that won’ t detain us here. We continue to follow Royall’ s treatment, though we’ d want to distinguish the mere indication of an isolated significant result from strong evidence . We’ d also have to audit for model assumptions and selection effects, but we assume these check out; after all, Royall’ s likelihood account also depends on the model holding.
We’ d argue along the following lines: were H 0 a reasonable description of the process, then with very high probability you would not be able to regularly produce d( x ) values as large as this:
Pr(d( X ) < d( x ); H 0 ) ≃ 0.997.
So if you manage to get such a large difference, I may infer that x indicates a genuine effect. Let’ s go back to Royall’ s contrast, because he’ s very unhappy with this.
Statistical Inference as Severe Testing Page 6