by Peter M Lee
2.10.3 Mixtures of conjugate densities
Suppose we have a likelihood and and are both densities in a conjugate family Π which give rise to posteriors and respectively. Let α and be any non-negative real numbers summing to unity, and write
Then (taking a little more care with constants of proportionality than usual) it is easily seen that the posterior corresponding to the prior is
where
with the constant of proportionality being such that
More generally, it is clearly possible to take any convex combination of more than two priors in Π and get a corresponding convex combination of the respective posteriors. Strictly in accordance with the definition given, this would allow us to extend the definition of Π to include all such convex combinations, but this would not retain the ‘naturalness’ of families such as the normal or the inverse chi-squared.
The idea can, however, be useful if, for example, you have a bimodal prior distribution. An example quoted by Diaconis and Ylvisaker (1985) is as follows. To follow this example, it may help to refer to Section 3.1 on ‘The binomial distribution’, or to return to it after you have read that section. Diaconis and Ylvisaker observe that there is a big difference between spinning a coin on a table and tossing it in the air. While tossing often leads to about an even proportion of ‘heads’ and ‘tails’, spinning often leads to proportions like or ; we shall write π for the proportion of heads, They say that the reasons for this bias are not hard to infer, since the shape of the edge will be a strong determining factor – indeed magicians have coins that are slightly shaved; the eye cannot detect the shaving but the spun coin always comes up ‘heads’. Assuming that they were not dealing with one of the said magician’s coins, they thought that a fifty–fifty mixture (i.e. of two beta densities, namely, (proportional to ) and (proportional to ), would seem a reasonable prior (actually they consider other possibilities as well). This is a bimodal distribution, which of course no beta density is, having modes, that is maxima of the density, near to the modes and at of the components.
They then spun a coin ten times, getting ‘heads’ three times. This gives a likelihood proportional to and so
that is,
or, since 13+27=23+17,
From the fact that , it is easily deduced that
We can deduce some properties of this posterior from those of the component betas. For example, the probability that π is greater than 0.5 is the sum 115/129 times the probability that a is greater than 0.5 and 14/129 times the probability that a is greater than 0.5; and similarly the mean is an appropriately weighted average of the means.
These ideas are worth bearing in mind if you have a complicated prior which is not fully dominated by the data, and yet want to obtain a posterior about which at least something can be said without complicated numerical integration.
2.10.4 Is your prior really conjugate?
The answer to this question is, almost certainly, ‘No’. Nevertheless, it is often the case that the family of conjugate priors is large enough that there is one that is sufficiently close to your real prior beliefs that the resulting posterior is barely distinguishable from the posterior that comes from using your real prior. When this is so, there are clear advantages in using a conjugate prior because of the greater simplicity of the computations. You should, however, be aware that cases can arise when no member of the conjugate family is, in the aforementioed sense, close enough, and then you may well have to proceed using numerical integration if you want to investigate the properties of the posterior.
2.11 The exponential family
2.11.1 Definition
It turns out that many of the common statistical distributions have a similar form. This leads to the definition that a density is from the one-parameter exponential family if it can be put into the form
or equivalently if the likelihood of n independent observations from this distribution is
It follows immediately from Neyman’s Factorization Theorem that is sufficient for θ given X.
2.11.2 Examples
Normal mean. If with known then
which is clearly of the above form.
Normal variance. If with θ known then we can express the density in the appropriate form by writing
Poisson distribution. In the Poisson case, we can write
Binomial distribution. In the binomial case we can write
2.11.3 Conjugate densities
When a likelihood function comes from the exponential family, so
there is an unambiguous definition of a conjugate family – it is defined to be the family Π of densities such that
This definition does fit in with the particular cases we have discussed before. For example, if x has a normal distribution with unknown mean but known variance, the conjugate family as defined here consists of densities such that
If we set , we see that
which is a normal density. Although the notation is slightly different, the end result is the same as the one we obtained earlier.
2.11.4 Two-parameter exponential family
The one-parameter exponential family, as its name implies, only includes densities with one unknown parameter (and not even all of those which we shall encounter). There are a few cases in which we have two unknown parameters, most notably when the mean and variance of a normal distribution are both unknown, which will be considered in detail in Section 2.12. It is this situation which prompts us to consider a generalization. A density is from the two-parameter exponential family if it is of the form
or equivalently if, given n independent observations , the likelihood takes the form
Evidently the two-dimensional vector is sufficient for the two-dimensional vector of parameters given X. The family of densities conjugate to such a likelihood takes the form
While the case of the normal distribution with both parameters unknown is of considerable theoretical and practical importance, there will not be many other two-parameter families we shall encounter. The idea of the exponential family can easily be extended to a k-parameter exponential family in an obvious way, but there will be no need for more than two parameters in this book.
2.12 Normal mean and variance both unknown
2.12.1 Formulation of the problem
It is much more realistic to suppose that both parameters of a normal distribution are unknown rather than just one. So we consider the case where we have a set of observations which are with θ and both unknown. Clearly,
from which it follows that the density is in the two-parameter exponential family as defined above. Further
wherein we define
(rather than as as in the case where the mean is known to be equal to μ). It is also convenient to define
(rather than s2=S/n as in the case where the mean is known).
It is worth noting that the two-dimensional vector or equivalently is clearly sufficient for given X.
Because this case can get quite complicated, we shall first consider the case of an indifference or ‘reference’ prior. It is usual to take
which is the product of the reference prior for θ and the reference prior for . The justification for this is that it seems unlikely that if you knew very little about either the mean or the variance, then being given information about the one would affect your judgements about the other. (Other possible priors will be discussed later.) If we do take this reference prior, then
For reasons which will appear later, it is convenient to set
in the power of , but not in the exponential, so that
2.12.2 Marginal distribution of the mean
Now in many real problems what interests us is the mean θ, and is what is referred to as a nuisance parameter. In classical (sampling theory) statistics, nuisance parameters can be a real nuisance, but there is (at least in principle) no problem from a Bayesian viewpoint. All we need to do is to find the marginal (posterior) distribution of θ, and you should recall from Section 1.4 on ‘Several Random Variables’ that
This integral is not too bad – all you need to do is to substitute
where
and it reduces to a standard gamma function integral
It follows that
which is the required posterior distribution of θ. However, this is not the most convenient way to express the result. It is usual to define
where (as defined earlier) . Because the Jacobian of the transformation from θ to t is a constant, the posterior density of t is given by
A glance at Appendix A will show that this is the density of a random variable with Student’s distribution on ν degrees of freedom, so that we can write . The fact that the distribution of t depends on the single parameter ν makes it sensible to express the result in terms of this distribution rather than that of θ itself, which depends on and S as well as on ν, and is consequently more complicated to tabulate. Note that as the standard exponential limit shows that the density of t is ultimately proportional to , which is the standard normal form. On the other hand, if we see that t has a standard Cauchy distribution C(0, 1), or equivalently that .
Because the density of Student’s t is symmetric about the origin, an HDR is also symmetric about the origin, and so can be found simply from a table of percentage points.
2.12.3 Example of the posterior density for the mean
Consider the data on uterine weight of rats introduced earlier in Section 2.8 on ‘HDRs for the Normal Variance.’ With those data, n=20, , and , so that and
We can deduce that that the posterior distribution of the true mean θ is given by
In principle, this tells us all we can deduce from the data if we have no very definite prior knowledge. It can help to understand what this means by looking for highest density regions. From tables of the t distribution the value exceeded by with probability 0.025 is . It follows that a 95% HDR for θ is , that is the interval (18, 24).
2.12.4 Marginal distribution of the variance
If we require knowledge about rather than θ, we use
as the last integral is that of a normal density. It follows that the posterior density of the variance is . Except that n is replaced by the conclusion is the same as in the case where the mean is known. Similar considerations to those which arose when the mean was known make it preferable to use HDRs based on log chi-squared, though with a different number of degrees of freedom.
2.12.5 Example of the posterior density of the variance
With the same data as before, if the mean is not known (which in real life it almost certainly would not be), the posterior density for the variance is . Some idea of the meaning of this can be got from looking for a 95% HDR. Because values of corresponding to an HDR for are found from the tables in the Appendix to be 9.267 and 33.921, a 95% HDR lies between 664/33.921 and 664/9.267, that is the interval (20, 72). It may be worth noting that this does not differ all that much from the interval (19, 67) which we found on the assumption that the mean was known.
2.12.6 Conditional density of the mean for given variance
We will find it useful in Section 2.13 to write the posterior in the form
Since
this implies that
which as the density integrates to unity implies that
that is that, for given and X, the distribution of the mean θ is . This is the result we might have expected from our investigations of the case where the variance is known, although this time we have arrived at the result from conditioning on the variance in the case where neither parameter is truly known.
A distribution for the two-dimensional vector of this form, in which has (a multiple of an) inverse chi-squared distribution and, for given , θ has a normal distribution, will be referred to as a normal/chi-squared distribution, although it is more commonly referred to as normal gamma or normal inverse gamma. (The chi-squared distribution is used to avoid unnecessary complications.)
It is possible to try to look at the joint posterior density of θ and , but two-dimensional distributions can be hard to visualize in the absence of independence, although numerical techniques can help. Some idea of an approach to this can be got from Box and Tiao (1992, Section 2.4).
2.13 Conjugate joint prior for the normal distribution
2.13.1 The form of the conjugate prior
In Section 2.12, we considered a reference prior for a normal distribution with both parameters unknown, whereas in this section we shall consider a conjugate prior for this situation. It is, in fact, rather difficult to determine which member of the conjugate family to use when substantial prior information is available, and hence in practice the reference prior is often used in the hope that the likelihood dominates the prior. It is also the case that the manipulations necessary to deal with the conjugate prior are a bit involved, although the end results are, of course, similar to those when we use a reference prior, with some of the parameters altered slightly. Part of the problem is the unavoidable notational complexity. Further, the notation is not agreed among the different writers on the subject. A new notation is introduced below.
We first recall that the likelihood is
Now suppose that your prior distribution of is (a multiple of) an inverse chi-squared on degrees of freedom. It may be convenient to think of as l0–1, so that your prior knowledge about is in some sense worth l0 observations. Thus,
Now suppose that, conditional on , your prior distribution for θ is normal of mean θ0 and variance , so that your prior knowledge is worth n0 observations of variance or their equivalent. It is not necessarily the case that n0=l0. Then
Thus, the joint prior is a case of a normal/chi-squared distribution, which was referred to briefly at the end of the last section. Its joint density is
where is the quadratic
It should be clear that by suitable choice of the parameters n0, θ0 and S0, the quadratic can be made into any non-negative definite quadratic form.
We note that the marginal prior density for θ is
(cf. Section 2.12), so that
and it follows from Appendix A on that the unconditional prior mean and variance of θ are θ0 and .
By taking
(so that the quadratic vanishes identically, that is ) we get the reference prior
It should be noted that if then is not a product of a function of θ and , so that θ and are not independent a priori. This does not mean that it is impossible to use priors other than the reference prior in which the mean and the variance are independent a priori, but that such a prior will not be in the conjugate family, so that the posterior distribution will be complicated and it may need a lot of numerical investigation to find its properties.
We shall deal with priors for θ and which are independent when we come to consider numerical methods in Chapter 9.
2.13.2 Derivation of the posterior
If the prior is of a normal/chi-squared form, then the posterior is
where
and is another quadratic in θ, namely,
which is in the form in which the prior was expressed, that is
if we define
(The second formula for S1 follows from the first after a little manipulation – its importance is that it is less subject to rounding errors.) This result has finally vindicated the claim that if the prior is normal/chi-squared, then so is the posterior, so that the normal/chi-squared family is conjugate to the normal likelihood with both mean and variance unknown. Thus, the posterior for is
and that for θ given is
Clearly, we can adapt the argument used when we considered the reference prior to find the marginal distribution for θ. Thus, the posterior distribution of
where
is a Student’s t distribution on degrees of freedom, that is .
It follows that if you use a conjugate prior, then your inferences should proceed as with the reference prior except that you have to replace ν by , S by S1, n by n1 and by .
2.13.3 Example
An experimental station has had experience with growing wheat which l
eads it to believe that the yield per plot is more or less normally distributed with mean 100 and standard deviation 10. The station then wished to investigate the effect of a growth hormone on the yield per plot. In the absence of any other information, the prior distribution for the variance on the plots might be taken to have mean 300 and standard deviation 160. As for the mean, it is expected to be about 110, and this information is thought to be worth about 15 observations. To fit a prior of a normal/chi-squared form first equate the mean and variance of (a multiple of) an inverse chi-squared distribution to 300 and 1602, so that
from which and hence and S0=2700. The other information gives and n0=15.
Twelve plots treated with the hormone gave the following yields:
141, 102, 73, 171, 137, 91, 81, 157, 146, 69, 121, 134,
so that n = 12, , , and so ,
Using the rounded values found earlier, the parameters of the posterior come to
It follows that a posteriori
In particular, is somewhere near [actually the exact mean of is ]. Using tables of on 23 degrees of freedom, a 95% HDR for θ is , that is the interval (103, 125). The chi-squared distribution can also be approximated by a normal distribution (see Appendix A).