by Peter M Lee
2.4 Dominant likelihoods
2.4.1 Improper priors
We recall from the previous section that, when we have several normal observations with a normal prior and the variances are known, the posterior for the mean is
where and are given by the appropriate formulae and that this approaches the standardized likelihood
insofar as is large compared with , although this result is only approximate unless is infinite. However, this would mean a prior density which, whatever θ0 were, would have to be uniform over the whole real line, and clearly could not be represented by any proper density function. It is basic to the concept of a probability density that it integrates to 1 so, for example,
cannot possibly represent a probability density whatever κ is, and in particular , which results from substituting into the normal density, cannot be a density. Nevertheless, we shall sometimes find it useful to extend the concept of a probability density to some cases like this where
which we shall call improper ‘densities’. The density can then be regarded as representing a normal density of infinite variance. Another example of an improper density we will have use for later on is
It turns out that sometimes when we take an improper prior density then it can combine with an ordinary likelihood to give a posterior which is proper. Thus, if we use the uniform distribution on the whole real line for some , it is easy to see that it combines with a normal likelihood to give the standardized likelihood as posterior; it follows that the dominant feature of the posterior is the likelihood. The best way to think of an improper density is as an approximation which is valid for some large range of values, but is not to be regarded as truly valid throughout its range. In the case of a physical constant which you are about to measure, you may be very unclear what its value is likely to be, which would suggest the use of a prior that was uniform or nearly so over a large range, but it seems unlikely that you would regard values in the region of, say, 10100 as being as likely as, say, values in the region of 10–100. But if you have a prior which is approximately uniform over some (possibly very long) interval and is never very large outside it, then the posterior is close to the standardized likelihood, and so to the posterior which would have resulted from taking an improper prior uniform over the whole real line. [It is possible to formalize the notion of an improper density as part of probability theory – for details, see Rényi (1970).]
2.4.2 Approximation of proper priors by improper priors
This result can be made more precise. The following theorem is proved by Lindley (1965, Section 5.2); the proof is omitted.
Theorem 2.1 A random sample of size n is taken from where is known. Suppose that there exist positive constants α, ε, M and c depending on x (small values of α and ε are of interest), such that in the interval defined by
where
the prior density of θ lies between ) and and outside it is bounded by Mc. Then the posterior density satisfies
inside , and
outside it.
While we are not going to prove the theorem, it might be worth while to give some idea of the sorts of bounds which it implies. Anyone who has worked with the normal distribution is likely to remember that the 1% point is 2.58, that is that if then = 2.58, so that extends 2.58 standard deviations [] on each side of the sample mean . Suppose then that before you had obtained any data you believed all values in some interval to be equally likely, and that there were no values that you believed to be more than three times as probable as the values in this interval. If then it turns out when you get the data the range lies entirely in this interval, then you can apply the theorem with , , and M = 3, to deduce that within the true density lies within multiples and of the normal density. We can regard this theorem as demonstrating how robust the posterior is to changes in the prior. Similar results hold for distributions other than the normal.
It is often sensible to analyze scientific data on the assumption that the likelihood dominates the prior. There are several reasons for this, of which two important ones are as follows. Firstly, even if you and I both have strong prior beliefs about the value of some unknown quantity, we might not agree, and it seems sensible to use a neutral reference prior which is dominated by the likelihood and could be said to represent the views of someone who (unlike ourselves) had no strong beliefs a priori. The difficulties of public discourse in a world where different individuals have different prior beliefs constitute one reason why a few people have argued that, in the absence of agreed prior information, we should simply quote the likelihood function [see Edwards, 1992], but there are considerable difficulties in the way of this (see also Section 7.1 on ‘The Likelihood Principle’). Secondly, in many scientific contexts, we would not bother to carry out an experiment unless we thought it was going to increase our knowledge significantly, and if that is the case then the likelihood will presumably dominate the prior.
2.5 Locally uniform priors
2.5.1 Bayes’ postulate
We have already said that it seems useful to have a reference prior to aid public discourse in situations where prior opinions differ or are not strong. A prior which does not change very much over the region in which the likelihood is appreciable and does not take very large values outside that region is said to be locally uniform. For such a prior
so that on normalizing the posterior must equal the standardized likelihood.
Bayes himself appears to have thought that, at least in the case where θ is an unknown probability between 0 and 1, the situation where we ‘know nothing’ should be represented by taking a uniform prior and this is sometimes known as Bayes’ postulate (as distinct from his theorem).
However, it should be noted that if, for example
then on writing
we have according to the usual change-of-variable rule
or
(as a check, this density does integrate to unity). Now it has been argued that if we ‘know nothing’ about θ then we equally ‘know nothing’ about , which should surely be represented by the improper prior
[although one can also argue for a prior proportional to or to ], so that the idea that a uniform prior can be used to represent ignorance is not self-consistent. It cannot be denied that this is a serious objection, but it is perhaps not quite as serious as it seems at first sight. With most transformations, the density of the transformed variable will not change very fast over a reasonably short interval. For example, while changes quite considerably over long intervals of , it is sufficiently close to constancy over any moderately short interval that a posterior based a uniform prior is unlikely to differ greatly from one based on the prior with density , provided that the amount of data available is not very small. This argument would not necessarily work if you were to consider a very extreme transformation, for example , but it could be argued that the mere fact that such an extreme transformation even crossed your mind would suggest that you had really got some prior information which made it sensible, and you should accordingly make use of your prior information.
2.5.2 Data translated likelihoods
Even though it may not make a great deal of difference within broad limits what we treat as our reference prior, provided that it is reasonably flat, there is still a natural urge to look for the ‘right’ scale of measurement in which to have a uniform prior, from which the prior in any other scale of measurement can be deduced. One answer to this is to look for a scale of measurement in which the likelihood is data translated. The likelihood is said to be in such a form if
for some function t (which we will later note is a sufficient statistic). In looking to see whether the likelihood can be expressed in this way, you should bear in mind that the definition of the likelihood function allows you to multiply it by any function of the data x alone.
For example, if we have an n-sample from a normal distribution of unknown mean and known variance , we know that
which is clearly of this form. On the other hand, if k has a binomial distri
bution of index n and parameter π, so that
then the likelihood cannot be put into the form .
If the likelihood is in data translated form, then different values of the data will give rise to the same functional form for the likelihood except for a shift in location. Thus, in the case of the normal mean, if we consider two experiments, one of which results in a value of which is, say, 5 larger than the other, then we get the same likelihood function in both cases except that corresponding values of θ differ by 5. This would seem to suggest that the main function of the data is to determine the location of the likelihood. Now if a uniform prior is taken for θ, the posteriors are also the same except that corresponding values differ by 5, so that the inferences made do seem simply to represent a determination of location. It is because of this that it seems sensible to adopt a uniform prior when the likelihood is data translated.
2.5.3 Transformation of unknown parameters
The next question is what we should do when it is not. Sometimes it turns out that there is a function
which is such that we can write
in which case the obvious thing to do is to take a prior uniform in rather than in θ, implying a prior for the parameter θ given by the usual change-of-variable rule. If, for example, x has an exponential distribution, that is (see under the Gamma distribution in Appendix A), then
so that (after multiplying by x as we are entitled to) we may write
This is in the above with
Unfortunately, it is often difficult to see how to express a likelihood function in this form even when it is possible, and it is not always possible. We shall find another case of this when we come to investigate the normal variance in Section 2.7, and a further one when we try to find a reference prior for the uniform distribution in Section 3.6. Sometimes there is a function such that the likelihood is approximately of this form, for example when we have a binomial distribution of known index and unknown parameter (this case will be considered when the reference prior for the binomial parameter is discussed in Section 3.2).
For the moment, we can reflect that this argument strengthens the case for using a uniform (improper) prior for the mean of a normal distribution
One way of thinking of the uniform distribution is as a normal distribution of infinite variance or equivalently zero precision. The equations for the case of normal mean and variance
then become
which accords with the result
for a locally uniform prior. An interesting defence of the notion of a uniform prior can be found in Savage et al. (1962, p. 20).
2.6 Highest density regions
2.6.1 Need for summaries of posterior information
In the case of our example on Ennerdale granophyre, all the information available after the experiment is contained in the posterior distribution. One of the best ways of conveying this information would be to sketch the posterior density (though this procedure is more difficult in cases where we have several parameters to estimate, so that θ is multi-dimensional). It is less trouble to the statistician to say simply that
although those without experience may need tables to appreciate what this assertion means.
Sometimes the probability that the parameter lies in a particular interval may be of interest. Thus, there might be geological reasons why, in the above example, we wanted to know the chance that the rocks were less than 400 million years old. If this is the case, the probability required is easily found by use of tables of the normal distribution. More commonly, there are no limits of any special interest, but it seems reasonable to specify an interval in which ‘most of the distribution’ lies. It would appear sensible to look for an interval which is such that the density at any point inside it is greater than the density at any point outside it, and it would also appear sensible to seek (for a given probability level) an interval that is as short as possible (in several dimensions, this means that it should occupy as small a volume as possible). Fortunately, it is clear that these conditions are equivalent. In most common cases, there is one such interval for each probability level.
We shall refer to such an interval as a highest (posterior) density region or an HDR. Although this terminology is used by several authors, there are other terms in use, for example Bayesian confidence interval (cf. Lindley 1965, Section 5.2) and credible interval (cf. Edwards et al. 1963, Section 5). In the particular example referred to aforementioned text, we could use the well-known fact that 95% of the area of a normal distribution lies within standard deviations of the mean to say that , that is (399, 427) is a 95% HDR for the age θ given the data.
2.6.2 Relation to classical statistics
The traditional approach, sometimes called classical statistics or sampling theory statistics would lead to similar conclusions in this case. From either standpoint
and in either case the interval (399, 427) is used at a 95% level. However, in the classical approach, it is x that is regarded as random and giving rise to a random interval which has a probability 0.95 of containing the fixed (but unknown) value θ. By contrast, the Bayesian approach regards θ as random in the sense that we have certain beliefs about its value, and think of the interval as fixed once the datum is available. Perhaps the tilde notation for random variables helps. With this, the classical approach amounts to saying that
with probability 0.95, while the Bayesian approach amounts to saying that
with probability 0.95.
Although there is a simple relationship between the conclusions that classical and Bayesian statisticians would arrive at in this case, there will be cases later on in which there is no great similarity between the conclusions arrived at.
2.7 Normal variance
2.7.1 A suitable prior for the normal variance
Suppose that we have an n-sample from where the variance is unknown but the mean μ is known. Then clearly
On writing
(remember that μ is known; we shall use a slightly different notation when it is not), we see that
In principle, we might have any form of prior distribution for the variance . However, if we are to be able to deal easily with the posterior distribution (and, e.g. to be able to find HDRs easily from tables), it helps if the posterior distribution is of a ‘nice’ form. This will certainly happen if the prior is of a similar form to the likelihood, namely,
where κ and S0 are suitable constants. For reasons which will emerge, it is convenient to write so that
leading to the posterior distribution
Although it is unlikely that our true prior beliefs are exactly represented by such a density, it is quite often the case that they can be reasonably well approximated by something of this form, and when this is so the calculations become notably simpler.
This distribution is, in fact, closely related to one of the best known continuous distributions in statistics (after the normal distribution), namely, the chi-squared distribution or distribution. This is seen more clearly if we work in terms of the precision instead of in terms of the variance , since, using the change of variable rule introduced in Section 1.3 on ‘Random Variables’,
Figure 2.1 Examples of inverse chi-squared densities.
This is now very close to the form of the chi-squared distribution (see Appendix A or indeed any elementary textbook on statistics). It is, in fact, a very simple further step (left as an exercise!) to check that (for given X)
that is has a distribution on degrees of freedom. [The term ‘degrees of freedom’ is hallowed by tradition, but is just a name for a parameter.] We usually indicate this by writing
and saying that has (a multiple of) an inverse chi-squared distribution.
Clearly the same argument can be applied to the prior distribution, so our prior assumption is that
It may be that you cannot find suitable values of the parameters ν and S0, so that a distribution of this type represents your prior beliefs, but clearly if values can be chosen, so that they are reasonably well approximated, it is convenien
t. Usually, the approximation need not be too close since, after all, the chances are that the likelihood will dominate the prior. In fitting a plausible prior one possible approach is to consider the mean and variance of your prior distribution and then choose ν and S0, so that
Inverse chi-squared distributions (and variables which have such a distribution apart from a constant multiple) often occur in Bayesian statistics, although the inverse chi-squared (as opposed to the chi-squared) distribution rarely occurs in classical statistics. Some of its properties are described in Appendix A, and its density for typical values of ν (and S0=1) is illustrated in Figure 2.1
2.7.2 Reference prior for the normal variance
The next thing to investigate is whether there is something which we can regard as a reference prior by finding a scale of measurement in which the likelihood is data translated. For this purpose (and others), it is convenient to define the sample standard deviation s by
(again we shall use a different definition when μ is unknown). Then
This is of data translated form