by Peter M Lee
2.1.1 Preliminary remarks
In this section, a general framework for Bayesian statistical inference will be provided. In broad outline, we take prior beliefs about various possible hypotheses and then modify these prior beliefs in the light of relevant data which we have collected in order to arrive at posterior beliefs. (The reader may prefer to return to this section after reading Section 2.2, which deals with one of the simplest special cases of Bayesian inference.)
2.1.2 Post is prior times likelihood
Almost all of the situations we will think of in this book fit into the following pattern. Suppose that you are interested in the values of k unknown quantities
(where k can be one or more than one) and that you have some a priori beliefs about their values which you can express in terms of the pdf
Now suppose that you then obtain some data relevant to their values. More precisely, suppose that we have n observations
which have a probability distribution that depends on these k unknown quantities as parameters, so that the pdf (continuous or discrete) of the vector X depends on the vector θ in a known way. Usually the components of θ and X will be integers or real numbers, so that the components of X are random variables, and so the dependence of X on can be expressed in terms of a pdf
You then want to find a way of expressing your beliefs about θ taking into account both your priorbeliefs and the data. Of course, it is possible that your prior beliefs about θ may differ from mine, but very often we will agree on the way in which the data are related to θ [i.e. on the form of p(X|θ)]. If this is so, we will differ in our posteriorbeliefs (i.e. in our beliefs after we have obtained the data), but it will turn out that if we can collect enough data, then our posterior beliefs will usually become very close.
The basic tool we need is Bayes’ Theorem for random variables (generalized to deal with random vectors). From this theorem, we know that
Now we know that p(X|θ) considered as a function of X for fixed θ is a density, but we will find that we often want to think of it as a function of θ for fixed X. When we think of it in that way it does not have quite the same properties – for example, there is no reason why it should sum (or integrate) to unity. Thus, in the extreme case where p(X|θ) turns out not to depend on θ, then it is easily seen that it can quite well sum (or integrate) to ∞. When we are thinking of p(X|θ) as a function of θ we call it the likelihood function. We sometimes write
Just as we sometimes write to avoid ambiguity, if we really need to avoid ambiguity we write
but this will not usually be necessary. Sometimes it is more natural to consider the log-likelihood function
With this definition and the definition of p(θ) as the prior pdf for θ and of p(θ|X) as the posterior pdf for θ given X, we may think of Bayes’ Theorem in the more memorable form
This relationship summarizes the way in which we should modify our beliefs in order to take into account the data we have available.
2.1.3 Likelihood can be multiplied by any constant
Note that because of the way we write Bayes’ Theorem with a proportionality sign, it does not alter the result if we multiply by any constant or indeed more generally by anything which is a function of X alone. Accordingly, we can regard the definition of the likelihood as being any constant multiple of p(X|θ) rather than necessarily equalling p(X|θ) (and similarly the log-likelihood is undetermined up to an additive constant). Sometimes the integral
(interpreted as a multiple integral if k> 1 and interpreted as a summation or multiple summation in the discrete case), taken over the admissible range of θ, is finite, although we have already noted that this is not always the case. When it is, it is occasionally convenient to refer to the quantity
We shall call this the standardized likelihood, that is, the likelihood scaled so that it integrates to unity and can thus be thought of as a density.
2.1.4 Sequential use of Bayes’ Theorem
It should also be noted that the method can be applied sequentially. Thus, if you have an initial sample of observations X, you have
Now suppose that you have a second set of observations distributed independently of the first sample. Then
But independence implies
from which it is obvious that
and hence
So we can find your posterior for θ given X and by treating your posterior given X as the prior for the observation . This formula will work irrespective of the temporal order in which X and are observed, and this fact is one of the advantages of the Bayesian approach.
2.1.5 The predictive distribution
Occasionally (e.g. when we come to consider Bayesian decision theory and empirical Bayes methods), we need to consider the marginal distribution
which is called the predictive distribution of X, since it represents our current predictions of the value of X taking into account both the uncertainty about the value of θ and the residual uncertainty about X when θ is known.
One valuable use of the predictive distribution is in checking your underlying assumptions. If, for example, p(X) turns out to be small (in some sense) for the observed value of X, it might suggest that the form of the likelihood you have adopted was suspect. Some people have suggested that another thing you might re-examine in such a case is the prior distribution you have adopted, although there are logical difficulties about this if p(θ) just represents your prior beliefs. It might, however, be the case that seeing an observation the possibility of which you had rather neglected causes you to think more fully and thus bring out beliefs which were previously lurking below the surface.
There are actually two cases in which we might wish to consider a distribution for X taking into account both the uncertainty about the value of θ and the residual uncertainty about X when θ is known, depending on whether the distribution for θ under consideration does or does not take into account some current observations, and some authors reserve the term ‘predictive distribution’ for the former case and use the term preposterior distribution in cases where we do not yet have any observations to take into account. In this book, the term ‘predictive distribution’ is used in both cases.
2.1.6 A warning
The theory described earlier relies on the possibility of specifying the likelihood as a function, or equivalently on being able to specify the density p(X|θ) of the observations X save for the fact that the k parameters are unknown. It should be borne in mind that these assumptions about the form of the likelihood may be unjustified, and a blind following of the procedure described earlier can never lead to their being challenged (although the point made earlier in connection with the predictive distribution can be of help). It is all too easy to adopt a model because of its convenience and to neglect the absence of evidence for it.
2.2 Normal prior and likelihood
2.2.1 Posterior from a normal prior and likelihood
We say that x is normal of mean θ and variance and write
when
Suppose that you have an unknown parameter θ for which your prior beliefs can be expressed in terms of a normal distribution, so that
and suppose also that you have an observation x which is normally distributed with mean equal to the parameter of interest, that is
where θ0, and are known. As mentioned in Section 1.3, there are often grounds for suspecting that an observation might be normally distributed, usually related to the Central Limit Theorem, so this assumption is not implausible. If these assumptions are valid
and hence
regarding as a function of θ.
It is now convenient to write
so that
and hence,
Adding into the exponent
which is constant as far as θ is concerned, we see that
from which it follows that as a density must integrate to unity
that is that the posterior density is
In terms of the precision, which we recall can be defined as the reciprocal of the variance, the
relationship can be remembered as
(It should be noted that this relationship has been derived assuming a normal prior and a normal likelihood.)
The relation for the posterior mean, , is only slightly more complicated. We have
which can be remembered as
2.2.2 Example
According to Kennett and Ross (1983), the first apparently reliable datings for the age of Ennerdale granophyre were obtained from the K/Ar method (which depends on observing the relative proportions of potassium 40 and argon 40 in the rock) in the 1960s and early 1970s, and these resulted in an estimate of million years. Later in the 1970s, measurements based on the Rb/Sr method (depending on the relative proportions of rubidium 87 and strontium 87) gave an age of million years. It appears that the errors marked are meant to be standard deviations, and it seems plausible that the errors are normally distributed. If then a scientist S had the K/Ar measurements available in the early 1970s, it could be said that (before the Rb/Sr measurements came in), S’s prior beliefs about the age of these rocks were represented by
We could then suppose that the investigations using the Rb/Sr method result in a measurement
We shall suppose for simplicity that the precisions of these measurements are known to be exactly those quoted, although this is not quite true (methods which take more of the uncertainty into account will be discussed later in the book). If we then use the above method, then, noting that the observation x turned out to be 421, we see that S’s posterior beliefs about θ should be represented by
where (retaining only one significant figure)
Thus the posterior for the age of the rocks is
that is million years.
Of course, all this assumes that the K/Ar measurements were available. If the Rb/Sr measurements were considered by another scientist who had no knowledge of these, but had a vague idea (in the light of knowledge of similar rocks) that their age was likely to be million years, that is
then would have a posterior variance
and a posterior mean of
so that ’s posterior distribution is
that is million years. We note that this calculation has been carried out assuming that the prior information available is rather vague, and that this is reflected in the fact that the posterior is almost entirely determined by the data.
The situation can be summarized as follows:
We note that in numerical work, it is usually more meaningful to think in terms of the standard deviation , whereas in theoretical work it is usually easier to work in terms of the variance itself.
We see that after this single observation the ideas of S and about θ as represented by their posterior distributions are much closer than before, although they still differ considerably.
2.2.3 Predictive distribution
In the case discussed in this section, it is easy to find the predictive distribution, since
and, independently of one another,
from which it follows that
using the standard fact that the sum of independent normal variates has a normal distribution. (The fact that the mean is the sum of the means and the variance the sum of the variances is of course true more generally as proved in Section 1.4 on ‘Several Random Variables’.)
2.2.4 The nature of the assumptions made
Although this example is very simple, it does exhibit the main features of Bayesian inference as outlined in the previous section. We have assumed that the distribution of the observation x is known to be normal but that there is an unknown parameter θ, in this case the mean of the normal distribution. The assumption that the variance is known is unlikely to be fully justified in a practical example, but it may provide a reasonable approximation. You should, however, beware that it is all too easy to concentrate on the parameters of a well-known family, in this case the normal family, and to forget that the assumption that the density is in that family for any values of the parameters may not be valid. The fact that the normal distribution is easy to handle, as witnessed by the way that normal prior and normal likelihood combine to give normal posterior, is a good reason for looking for a normal model when it does provide a fair approximation, but there can easily be cases where it does not.
2.3 Several normal observations with a normal prior
2.3.1 Posterior distribution
We can generalize the situation in the previous section by supposing that a priori
but that instead of having just one observation we have n independent observations such that
We sometimes refer to X as an n-sample from . Then
Proceeding just as we did in Section 2.3 when we had only one observation, we see that the posterior distribution is
where
We could alternatively write these formulae as
which shows that, assuming a normal prior and likelihood, the result is just the same as the posterior distribution obtained from the single observation of the mean , since we know that
and the above formulae are the ones we had before with replaced by and x by . (Note that the use of a bar over the x here to denote a mean is unrelated to the use of a tilde over x to denote a random variable).
We would of course obtain the same result by proceeding sequentially from to and then treating as prior and x2 as data to obtain and so on. This is in accordance with the general result mentioned in Section 2.1 on ‘Nature of Bayesian Inference’.
2.3.2 Example
We now consider a numerical example. The basic assumption in this section is that the variance is known, even though in most practical cases, it has to be estimated. There are a few circumstances in which the variance could be known, for example when we are using a measuring instrument which has been used so often that its measurement errors are well known, but there are not many. Later in this book, we will discover two things which mitigate this assumption – firstly, that the numerical results are not much different when we do take into account the uncertainty about the variance, and, secondly, that the larger the sample size is, the less difference it makes.
The data we will consider are quoted by Whittaker and Robinson (1940, Section 97). They consider chest measurements of 10 000 men. Now, based on memories of my experience as an assistant in a gentlemen’s outfitters in my university vacations, I would suggest a prior
Of course, it is open to question whether these men form a random sample from the whole population, but unless I am given information to the contrary I would stick to the prior I have just quoted, except that I might be inclined to increase the variance. Whitaker and Robinson’s data show that the mean turned out to be 39.8 with a standard deviation of 2.0 for their sample of 10 000. If we put the two together, we end with a posterior mean for the chest measurements of men in this population is normal with variance
and mean
Thus, for all practical purposes we have ended up with the distribution
suggested by the data. You should note that this distribution is
the distribution we referred to in Section 2.1 on ‘Nature of Bayesian Inference’ as the standardized likelihood. Naturally, the closeness of the posterior to the standardized likelihood results from the large sample size, and whatever my prior had been, unless it were very very extreme, I would have got very much the same result. More formally, the posterior will be close to the standardized likelihood insofar as the weight
associated with the prior mean is small, that is insofar as is large compared with . This is reassuring in cases where the prior is not very easy to specify, although of course there are cases where the amount of data available is not enough to get to this comforting position.
2.3.3 Predictive distribution
If we consider taking another one observation xn+1, then the predictive distribution can be found just as in Section 2.3 by writing
and noting that, independently of one another,
so that
It is easy enough to adapt this argument to find the predictive distribution of an m-vector where
/>
by writing
where is the constant vector
Then θ has its posterior distribution and the components of the vector are variates independent of θ and of one another, so that has a multivariate normal distribution, although its components are not independent of one another.
2.3.4 Robustness
It should be noted that any statement of a posterior distribution and any inference is conditional not merely on the data, but also on the assumptions made about the likelihood. So, in this section, the posterior distribution ends up being normal as a consequence partly of the prior but also of the assumption that the data was distributed normally, albeit with an unknown mean. We say that an inference is robust if it is not seriously affected by changes in the assumptions on which it is based. The notion of robustness is not one which can be pinned down into a more precise definition, and its meaning depends on the context, but nevertheless the concept is of great importance and increasing attention is paid in statistics to investigations of the robustness of various techniques. We can immediately say that the conclusion that the nature of the posterior is robust against changes in the prior is valid provided that the sample size is large and the prior is a not-too-extreme normal distribution or nearly so. Some detailed exploration of the notion of robustness (or sensitivity analysis) can be found in Kadane (1984).