Bayesian Statistics (4th ed)
Page 25
It transpires that some densities which we are interested in turn up as stationary distributions for Markov chains.
9.2 The EM algorithm
9.2.1 The idea of the EM algorithm
A useful numerical technique which finds the posterior mode, that is, the value at which or equivalently is a maximum, but does not provide full information on the posterior distribution is the EM algorithm. We can exemplify this by an example on genetic linkage due to Rao (1973, Section 5g) quoted by Dempster et al. (1977), by Gelfand and Smith (1990) and by Tanner (1996, Section 4.1). We have observations with cell probabilities
and we want to estimate η. The likelihood is then
What we do is to augment the data by adding further data to produce augmented data . It should be noted that to a considerable extent the distinction between parameters of a model and augmentations to the data is artificial (i.e. fictitious augmentations to the data can be regarded as parameters of a model). In the example under consideration, the augmentation consists simply in splitting the first cell into two cells with probabilities and . The advantage of this is that the likelihood then takes the much simpler form
and if we take the standard reference prior Be(0, 0), then the posterior is a beta distribution
Not much in life is free, and in this case we have to pay for the greater simplicity of the model by estimating the split of x1 into y0 and y1. The EM algorithm for finding the posterior mode is an iterative method starting from some plausible guess for the value of η. At stage t, we suppose that the current guess is . Each stage has two steps. In the first, the E-step (expectation step), we compute
that is, the expectation of the log-likelihood function, the expectation being computed at , so that
In the second step, the M-step (the maximization step), we we find that value of η which maximizes . In this particular example as yi=xi for i> 1
For the M-step, we note that
and equating this to zero we deduce that
Since y1 has a binomial distribution with
so that
the iteration becomes
The values actually observed were x1=125, x2=18, x3=20, x4=34. We can then estimate η by iteration starting from, for example, and using
In fact, the iteration will converge to the positive root of which is 0.630.
9.2.2 Why the EM algorithm works
We will first show that increases with t and, presuming that converges to some limit , then show that at so that will normally be the posterior mode.
From the obvious equation considered conditional on , it follows, on noting that includes and hence , that
Multiplying both sides by the density and integrating (noting that the left hand side does not depend on ), we get for any fixed
(we note that K clearly does not depend on η). Taking , we get
Now because of the way in which we chose . Moreover,
where
Because (as is easily seen by differentiation) takes a minimum value of 0 when u=1, we see that for all t. Consequently,
so that
We now show that the limit is a stationary point. As
we see that provided Q is a reasonably smooth function
Further
and in particular taking
Since for any fixed , and in particular for , it now follows that
so that is a stationary point, which will usually be the posterior mode.
More information about the EM algorithm can be found in Dempster et al. (1977) or Tanner (1996).
9.2.3 Semi-conjugate prior with a normal likelihood
Suppose that we have independent normally distributed observations x1, x2, … , . We use the notation
In Section 2.13, we noted that the conjugate joint prior for the normal mean and variance is not a product of a function of θ and , so that θ and are not independent a priori. Nevertheless, an assumption of prior independence would seem appropriate for many problems, and so we are sometimes led to consider a situation in which
independently of one another. Such a prior is described as semi-conjugate by Gelman et al. (2004, Section 3.4) and as conditionally conjugate by O’Hagan (2004, Section 6.33). The latter term arises because conditional on knowledge of θ, it is a conjugate prior for and vice versa.
With such a prior, we know that, conditional on knowledge of , the posterior of θ is
where
(cf. Section 2.3) whereas, conditional on knowledge of θ, the posterior of is
where
(cf. Section 2.7).
In such a case, we can use the EM algorithm to estimate the posterior mode of θ, effectively augmenting the data with the variance . We begin by observing that (ignoring terms which do not depend on θ) the log posterior density is
To carry out the E-step we first note that since
and the expectation of a chi-squared variate is equal to its number of degrees of freedom, we know that
where and . For the M-step we must find that value which maximizes . But is of the form of the log of the posterior density we would have got for θ had we had the same prior () and observations with mean θ and known variance S1/n1. Now it follows from Section 2.3 on ‘Several normal observations with a normal prior’ that the posterior mean of θ in this case is at
(the right-hand side depends on through S1), and so we see that this is the value of θ giving the required maximum.
To illustrate this case, we return to the data on wheat yield considered in the example towards the end of Section 2.13 in which n=12, and S=13 045. We will take the prior for the variance considered in that Section, namely, with S0=2700 and . For the mean, we will take a prior which is where and , which approximates well to the values for the marginal distribution in that Section (according to which ). The resulting marginal densities are nearly the same, but because we now assume prior independence the joint distribution is different.
It seems that a reasonable starting point for the iteration is . Then we get , , and thereafter.
9.2.4 The EM algorithm for the hierarchical normal model
The EM algorithm is well suited to the analysis of hierarchical models and we can see this by examining the important case of the hierarchical normal model. Although the formulae we shall derive do not look pretty, they are very easy to use. Suppose that
with , where
as in Section 8.5 and that we wish to estimate the hyperparameters μ, and . In this case takes the place of η in the example on genetic linkage, while we augment the data by to give augmented data . We can use a reference prior for , but (as mentioned in Section 8.5) this will not work for . Accordingly, we will adopt the prior
It is then easily seen that (up to an additive constant)
To carry out the E-step, observe that conditional on the value of the parameters where
using the notation introduced in Section 6.5 for averaging over a suffix (cf. Section 2.3). We can now see that
and similarly
so that
It is now clear that the M-step gives
9.2.5 A particular case of the hierarchical normal model
Gelman et al. (2004, Table 11.2) quote the following data from Box et al. (1978, Table 6.1) on the coagulation time in seconds for blood drawn from N=24 animals randomly allocated to four different diets:
The data has been slightly adjusted, so that the averages come out to be whole numbers. It should perhaps be noted that the prior adopted in this and the preceding subsections differs slightly from that adopted by Gelman et al. The within-groups sum of squares is and the overall mean is . It, therefore, seems reasonable to start iterations from and . As for , we can take
This results in
Similarly, we get , , , , and .
We can now feed these values in and commence the iteration, getting , and . Continuing the iteration, we get rapid convergence to , , , , , and .
9.3 Data augmentation by Monte Carlo
9.3.1 The genetic linkage example revisited
We can
illustrate this technique with the example on genetic linkage we considered in connection with the EM algorithm. Recall that we found that likelihood of the augmented data was
In this method, we suppose that at each stage we have a ‘current’ distribution for η, which initially is the prior distribution. At all stages, this has to be a proper distribution, so we may as well take our prior as the uniform distribution Be(1, 1), which in any case differs little from the reference prior Be(0, 0). At the tth stage in the imputation step, we pick m possible values , … , of η by some (pseudo-) random mechanism with the current density, and then for each of these values of we generate a value for the augmented data , which in the particular example simply means picking a value y(i)1 with a binomial distribution of index x1 and parameter . Since we had a Be(1, 1) prior, this gives a posterior
In the posterior step we now take the new ‘current’ distribution of η as a mixture of the m beta distributions so generated, all values of i being treated as equally likely (cf. the end of Section 2.10). We could then construct an estimate for the posterior from a histogram of the values of η found at each iteration, but a better and smoother estimate results from taking the ‘current’ distribution at the end of the iteration.
It is worth noting that,
In practice, it is inefficient to take m large during the first few iterations when the posterior distribution is far from the true distribution. Rather, it is suggested that m initially be small and then increased with successive iterations (Tanner and Wong, 1987).
9.3.2 Use of
A number of examples in this chapter and the next are set out as programs in . The project for statistical computing is described on the web page
http://www.r-project.org/
and useful books covering it are Dalgaard (2008), Krause and Olsen (2000), Fox (2002) and Venables and Ripley (2002). A very good book on Bayesian statistics with examples in is Albert (2009). At a lower level, Gill (2002) is useful (although it may be noted that his highest posterior densities (HPDs) are only approximately highest density regions). Another book with much useful material is Robert and Casella (2010). For the ordinary user, is virtually indistinguishable from S-plus, but has the advantage that it is free.
Even if you do not know , these programs should be reasonable easy to follow once you realize that <- (a form of arrow) is an assignment operator. In fact, programs are provided on the web site associated with the book for all the numerical examples in this book, making use of the programs for finding HDRs and various functions associated with Behrens’ distribution which can be found in Appendix C.
9.3.3 The genetic linkage example in
A program in for the genetic linkage example is as follows:
niter < - 50
minitial < - 20
etadivs < - 1000
mag < - 10
scale < - 8
x < - c(125,18,20,34)
pagethrow < - 12
m < - minitial
eta < - runif(m) # random from U(0,1)
y < - rbinom(m,x[1],eta/(eta+2)) # random binomial
for (t in 1:niter)
{ mold < - m
if (t > 30)
m < - 200
if (t > 40)
x m < - 1000
i0 < - floor(runif(m,0,mold))
eta < - rbeta(m,y[i0]+x[4]+1,x[2]+x[3]+1) # random beta
y < - rbinom(m,x[1],eta/(eta+2))
} p < - rep(0,etadivs) # vector of etadivs zeroes
for (etastep in 1:(etadivs-1)){
eta < - etastep/etadivs
term < - exp((y+x[4])*log(eta) + (x[2]+x[3])*log(1-eta)
+lgamma(y+x[2]+x[3]+x[4]+2)-lgamma(y+x[4]+1)
-lgamma(x[2]+x[3]+1)) # lgamma is log gamma fn
p[etastep] < - p[etastep] + sum(term)/m
} plot(1:etadivs/etadivs,p,pch=".",xlab="eta")
The resulting plot is shown as Figure 9.1. The function is thus evaluated for , 1/n, 2/n, … , 1, and can now be used to estimate the posterior density of η. A simulation in with T=50, m=m(t)=20 for , m=200 for , m=1000 for t> 40, and n=1000 showed a posterior mode at , which is close to the value found earlier using the EM algorithm. However, it should be noted that this method gives an approximation to the whole posterior distribution as opposed to the EM algorithm which will only give the mode.
Figure 9.1 Plot of for the linkage example.
Programs and algorithms for generating pseudo-random numbers with many common distributions can be found in Press et al. (1986–1993) or Kennedy and Gentle (1980); it may help to note that discrete uniform variates are easily constructed by seeing which of m equal intervals a U(0, 1) variate falls in, and a variate w can be constructed as u/(u+v), where and .
The data augmentation technique was originated by Tanner and Wong (1987) and is discussed in detail in Tanner (1996).
9.3.4 Other possible uses for data augmentation
Suppose that x1, x2, … , xn are independently where θ is unknown but is known but that instead of a normal prior for θ you have a prior
so that
has a Student’s t distribution (where μ and s are known). The posterior for θ in this case is of a complicated form, but if we augment the data with a parameter such that
so that the unconditional distribution of θ is as above (cf. Section 2.12; n in that section is here replaced by unity, so we do not have ), then the conditional distribution of θ given and is normal and we can now use the algorithm to find .
Further examples can be found in Tanner (1996).
9.4 The Gibbs sampler
9.4.1 Chained data augmentation
We will now restrict our attention to cases where m=1 and the augmented data consists of the original data augmented by a single scalar z (as in the linkage example). The algorithm can then be expressed as follows: Start from a value generated from the prior distribution for η and then iterate as follows:
(a1) Choose of η from the density ;
(a2) Choose z(i+1) of z from the density .
(There is, of course, a symmetry between η and z and the notation is used simply because it arose in connection with the first example we considered in connection with the data augmentation algorithm.) This version of the algorithm can be referred to as chained data augmentation, since it is easy to see that the distribution of the next pair of values given the values up to now depends only on the present pair and so these pairs move as a Markov chain. It is a particular case of a numerical method we shall refer to as the Gibbs sampler. As a result of the properties of Markov chains, after a reasonably large number T iterations the resulting values of η and z have a joint density which is close to , irrespective of how the chain started.
Successive observations of the pair will not, in general, be independent, so in order to obtain a set of observations which are, to all intents and purposes, independently identically distributed (i.i.d.), we can run the aforementioned process through T successive iterations, retaining only the final value obtained, on k different replications.
We shall illustrate the procedure by the following example due to Casella and George (1992). We suppose that π and y have the following joint distribution:
and that we are interested in the marginal distribution of y. Rather than integrating with respect to π (as we did at the end of Section 3.1), which would show that y has a beta-binomial distribution, we proceed to find the required distribution from the two conditional distributions:
This is a simple case in which there is no observed data . We need to initialize the process somewhere, and so we may as well begin with a value of π which is chosen from a U(0, 1) distribution. Again the procedure is probably best expressed as an program, which in this case constructs a histogram as well as the better and smoother estimate resulting from taking the ‘current’ distribution at the end of each iteration:
nr < - 50
m < - 500
k < - 10
n < - 16
alpha < - 2.0
beta < - 4.0
> lambda < - 16.0
maxn < - 24
h < - rep(0,n+1)
for (i in 1:m)
{
pi < - runif(1)
for (j in 1:k)
{
y < - rbinom(1,n,pi)
newalpha < - y + alpha
newbeta < - n - y + beta
pi < - rbeta(1,newalpha,newbeta)
}
for (t in 0:n)
{
if (t == y)
h[t+1] < - h[t+1] + 1
}
}
barplot(h)
The resultant histogram is given as Figure 9.2. References for the generation of ‘pseudo-random’ numbers can be found in the previous section. An implementation in with T=10 iterations replicated m=500 times for the case n=16, and will, as Casella and George remark, give a very good approximation to the beta-binomial density.
Figure 9.2 Barplot of values of y between 0 and 16.
9.4.2 An example with observed data
Gaver and O’Muircheartaigh (1987) quote the data later on the rates of loss of feedwater flow for a collection of nuclear power systems. In the following table, a small set of data representing failure of pumps in several systems of the nuclear plant Farley 1 is given. The apparent variation in failure rates has several sources.