by Peter M Lee
We first need to define a mixed experiment. Suppose that there are two experiments, and and that the random variable is such that , whatever θ is and independently of y and z. Then the mixed experiment consists of carrying out E1 if k = 1 and E2 if k = 2. It can also be defined as the triple where
and
We only need to assume the following rather weak form of the principle:
Weak conditionality principle. If E1, E2 and are as defined earlier, then
that is, the evidence about θ from is just the evidence from the experiment actually performed.
7.1.3 The sufficiency principle
The sufficiency principle says that if t(x) is sufficient for θ given x, then any inference we may make about θ may be based on the value of t, and once we know that we have no need of the value of x. We have already seen in Section 2.9 that Bayesian inference satisfies the sufficiency principle. The form in which the sufficiency principle will be used in this section is as follows:
7.1.3.1 Weak sufficiency principle
Consider the experiment and suppose that t=t(x) is sufficient for θ given x. Then if t(x1)=t(x2)
This clearly implies that, as stated in Corollary 2.1 in Section 2.9.3, ‘For any prior distribution, the posterior distribution of θ given is the same as the posterior distribution of θ given a sufficient statistic t’. In Bayesian statistics inference is based on the posterior distribution, but this principle makes it clear that even if we had some other method of arriving at conclusions, x1 and x2 would still lead to the same conclusions.
7.1.4 The likelihood principle
For the moment, we will state what the likelihood principle is – its implications will be explored later.
7.1.4.1 Likelihood principle
Consider two different experiments and where θ is the same quantity in each experiment. Suppose that there are particular possible outcomes of experiment E1 and of E2 such that
for some constant c, that is, the likelihoods of θ as given by these possible outcomes of the two experiments are proportional, so that
Then
The following theorem [due to Birnbaum (1962)] shows that the likelihood principle follows from the other two principles described earlier.
Theorem 7.1 The likelihood principle follows from the weak conditionality principle and the weak sufficiency principle.
Proof. If E1 and E2 are the two experiments about θ figuring in the statement of the likelihood principle, consider the mixed experiment which arose in connection with the weak conditionality principle. Define a statistic t by
(Note that if experiment 2 is performed and we observe the value then by the assumption of the likelihood principle there is a value such that so we can take this value of in the proof.) Now note that if then
whereas if then
and if but then
while for and all other x we have . In no case does depend on θ and hence, from the definition given when sufficiency was first introduced in Section 2.9, t is sufficient for θ given x. It follows from the weak sufficiency principle that . But the weak conditionality principle now ensures that
establishing the likelihood principle.
Corollary 7.1 If is an experiment, then should depend on E and x only through the likelihood
Proof. For any one particular value x1 of x define
so that (since we have assumed for simplicity that everything is discrete this will not, in general, be zero). Now let the experiment E1 consist simply of observing y, that is, of noting whether or not x=x1. Then the likelihood principle ensures that , and E1 depends solely on and hence solely on the likelihood of the observation actually made.
Converse 7.1 If the likelihood principle holds, then so do the weak conditionality principle and the weak sufficiency principle.
Proof. Using the notation introduced earlier for the mixed experiment, we see that if x=(1, y) then
and so by the likelihood principle , implying the weak conditionality principle. Moreover, if t is a sufficient statistic and t(x1)=t(x2), then x1 and x2 have proportional likelihood functions, so that the likelihood principle implies the weak sufficiency principle.
7.1.5 Discussion
From the formulation of Bayesian inference as ‘posterior is proportional to prior times likelihood,’ it should be clear that Bayesian inference obeys the likelihood principle. It is not logically necessary that if you find the arguments for the likelihood principle convincing, you have to accept Bayesian inference, and there are some authors, for example, Edwards (1992), who have argued for a non-Bayesian form of inference based on the likelihood. Nevertheless, I think that Savage was right in saying in the discussion on Birnbaum (1962) that ‘… I suspect that that once the likelihood principle is widely recognized, people will not long stop at that halfway house but will go forward and accept the implications of personalistic probability for statistics’.
Conversely, much of classical statistics notably fails to obey the likelihood principle – any use of tail areas (e.g. the probability of observing a value as large as that seen or greater) evidently involves matters other than the likelihood of the observations actually made. Another quotation from Savage, this time from Savage et al. (1962), may help to point to some of the difficulties that arise in connection with confidence intervals.
Imagine, for example, that two Meccans carefully drawn at random differ from each other in height by only 0.01 mm. Would you offer 19 to 1 odds that the standard deviation of the height of Meccans is less than 1.13 mm? That is the 95 per cent upper confidence limit computed with one degree of freedom. No, I think you would not have enough confidence in that limit to offer odds of 1 to 1.
In fact, the likelihood principle has serious consequences for both classical and Bayesian statisticians and some of these consequences will be discussed in the Sections 7.2–7.4. For classical statisticians, one of the most serious is the stopping rule principle, while for Bayesians one of the most serious is that Jeffreys’ rule for finding reference priors is incompatible with the likelihood principle.
7.2 The stopping rule principle
7.2.1 Definitions
We shall restrict ourselves to a simple situation, but it is possible to generalize the following account considerably; see Berger and Wolpert (1988, Section 4.2). Basically, in this section, we will consider a sequence of experiments which can be terminated at any stage in accordance with a rule devised by the experimenter (or forced upon him).
Suppose that the observations are independently and identically distributed with density and let
We say that s is a stopping rule or a stopping time if it is a random variable whose values are finite natural numbers with probability one, and is such that whether or not s> m depends solely on . In a sequential experiment E we observe the values where s is such a stopping rule and then stop. The restriction on the distribution of s means simply that whether or not you decide to stop cannot depend on future observations (unless you are clairvoyant), but only on the ones you have available to date.
7.2.2 Examples
1. Fixed sample size from a sequence of Bernoulli trials. Suppose that the xi are independently 1 with probability (representing ‘success’) or 0 with probability (representing ‘failure’). If s=n where n is a constant, then we have the usual situation which gives rise to the binomial distribution for the total number of successes.
2. Stopping after the first success in Bernoulli trials. With the xi as in the previous example, we could stop after the first success, so that
Because the probability that s> n is which tends to 0 as , this is finite with probability 1.
3. A compromise between the first two examples. With the xi as in the previous two examples. we could stop after the first success if that occurs at or before the nth trial, but if there has not been a success by then, stop at the nth trial, so that
4. Fixed size sample from the normal distribution. If the xi are independently and s=n where n is a constant, then we have a case which has arisen often
before of a sample of fixed size from a normal distribution.
5. Stopping when a fixed number of standard deviations from the mean. Still taking , we could have
which, as , means stopping as soon as we observe a value of that is at least c standard deviations from the mean. It is not obvious in this case that s is finite with probability 1, but it follows from the law of the iterated logarithm, a proof of which can be found in any standard text on probability.
7.2.3 The stopping rule principle
The stopping rule principle is that in a sequential experiment, if the observed value of the stopping rule is m, then the evidence provided by the experiment about the value of θ should not depend on the stopping rule.
Before deciding whether it is valid, we must consider what it means. It asserts, for example, that if you observe ten Bernoulli trials, nine of which result in failure and only the last in success, then any inference about the probability of successes cannot depend on whether the experimenter had all along intended to carry out ten trials and had, in fact, observed one success, or whether he or she had intended to stop the experiment immediately the first success was observed. Thus, it amounts to an assertion that all that matters is what actually happened and not the intentions of the experimenter if something else had happened.
Theorem 7.2 The stopping rule principle follows from the likelihood principle, and hence is a logical consequence of the Bayesian approach.
Proof. If the xi are discrete random variables, then it suffices to note that the likelihood
which clearly does not depend on the stopping rule. There are some slight complications in the continuous case, which are largely to do with measure theoretic complications, and in particular with events of probability zero, but a general proof from the so-called relative likelihood principle is more or less convincing; for details, see Berger and Wolpert (1988, Sections 3.4.3 and 4.2.6).
7.2.4 Discussion
The point about this is as follows. A classical statistician is supposed to choose the stopping rule before the experiment and then follow it exactly. In actual practice, the ideal is often not adhered to; an experiment can end because the data already looks good enough, or because there is no more time or money, and yet the experiment is often analyzed as if it had a fixed sample size. Although stopping for some reasons would be harmless, statisticians who stop ‘when the data looks good’, a process which is sometimes described as optional (or optimal) stopping, can produce serious errors if used in a classical analysis.
It is often argued that a single number which is a good representation of our knowledge of a parameter should be unbiased, that is, should be such that its expectation over repeated sampling should be equal to that parameter. Thus, if we have a sample of fixed size from a Bernoulli distribution [example (1), mentioned before], then , so that is in that sense a good estimator of . However, if the stopping rule in example (2) or that in example (3), is used, then the proportion will, on average, be more than . If, for example, we take example (3) with n = 2, then
Thus, if a classical statistician who used the proportion of successes actually observed as an estimator of the probability of success, would be accused of ‘making the probability of success look larger than it is’.
The stopping rule principle also plays havoc with classical significance tests. A particular case can be constructed from example (5) above with, for example, c = 2. If a classical statistician were to consider data from an population in which (unknown to him or her) , then because s is so constructed that, necessarily, the value of is at least c standard deviations from the mean, a single sample of a fixed size equal to this would necessarily lead to a rejection of the null hypothesis that at the 5% level. By taking other values of c, it can be seen that a crafty classical statistician could arrange to reject a null hypothesis that was, in fact, true, at any desired significance level.
It can thus be seen that the stopping rule principle is very hard to accept from the point of view of classical statistics. It is for these reasons that Savage said that
I learned the stopping rule principle from Professor Barnard, in conversation in the summer of 1952. Frankly, I then thought it a scandal that anyone in the profession could advance an idea so patently wrong, even as today I can scarcely believe some people can resist an idea so patently right (Savage et al., 1962, p. 76).
From a Bayesian viewpoint, there is nothing to be said for unbiased estimates, while a test of a sharp null hypothesis would be carried out in quite a different way, and if (as is quite likely if in fact ) the sample size resulting in example (5) were very large, then the posterior probability that would remain quite large. It can thus be seen that if the stopping rule is seen to be plausible, and it is difficult to avoid it in view of the arguments for the likelihood principle in the last section, then Bayesian statisticians are not embarrassed in the way that classical statisticians are.
7.3 Informative stopping rules
7.3.1 An example on capture and recapture of fish
A stopping rule s is said to be informative if its distribution depends on θ in such a way that it conveys information about θ in addition to that available from the values of . The point of this section is to give a non-trivial example of an informative stopping rule; the example is due to Roberts (1967).
Consider a capture–recapture situation for a population of fish in a lake. The total number N of fish is unknown and is the parameter of interest (i.e. it is the θ of the problem). It is known that R of the fish have been captured tagged and released, and we shall write S for the number of untagged fish. Because S=N–R and R is known, we can treat S as the unknown parameter instead of N, and it is convenient to do so. A random sample of n fish is then drawn (without replacement) from the lake. The sample yields r tagged fish and s=n–r untagged ones.
Assume that there is an unknown probability of catching each fish independently of each other. Then the stopping rule is given by the binomial distribution as
so that is a nuisance parameter such that . Note that this stopping rule is informative, because it depends on N=R+S.
Conditional on R, N, and n, the probability of catching r tagged fish out of n=r+s is given by the hypergeometric distribution
Because we know r and s if and only if we know r and n, it follows that
7.3.2 Choice of prior and derivation of posterior
We assume that not much is known about the number of the fish in the lake a priori, and we can represent this by an improper prior
On the other hand, in the process of capturing the first sample R for tagging, some knowledge will have been gained about the probability of catching a fish. Suppose that this knowledge can be represented by a beta prior, so that , that is,
independently of S. It follows that
where
It follows that for given the distribution of S is such that S–s has a negative binomial distribution (see Appendix A). Summing over S from s to , it can also be seen that
so that the posterior for is .
To find the unconditional distribution of S, it is necessary to integrate the joint posterior for S and over . It can be shown without great difficulty that the result is that
where is the usual beta function. This distribution is sometimes known as the beta-Pascal distribution, and its properties are investigated by Raiffa and Schlaifer (1961, Section 7.11). It follows from there that the posterior mean of S is
from which the posterior mean of N follows as N=R+S.
7.3.3 The maximum likelihood estimator
A standard classical approach would seek to estimate S or equivalently N by the maximum likelihood estimator, that is, by the value of N which maximizes
Now it is easily shown that
and this increases as a function of S until it reaches unity when (r+s)S=(R+S)s and then decreases, so that the maximum likelihood estimator of S is
7.3.4 Numerical example
As a numerical example, suppose that the original catch was R = 41 fish a
nd that the second sample results in r = 8 tagged and s = 24 untagged fish. Suppose further that the prior for the probability of catching a fish is , so that
(so that and ). Then the posterior mean of S is
and hence that of N is , that is, 41+199=240. On the other hand, the same data with a reference prior for (i.e. ) results in a posterior mean for S of
and hence that of N is 41+161.5=202.5.
Either of these answers is notably different from the maximum likelihood answer that a classical statistician would be likely to quote, which is
resulting in . The conclusion is that an informative stopping rule can have a considerable impact on the conclusions, and (though this is scarcely surprising) that prior beliefs about the nuisance parameter make a considerable difference.
7.4 The likelihood principle and reference priors
7.4.1 The case of Bernoulli trials and its general implications