Bayesian Statistics (4th ed)
Page 27
Another artificial example quoted by O’Hagan (1994, Example 8.11) arises if where and take only the values 0 and 1 and have joint density
for some so that
and similarly for given . It is not hard to see that if we operate a Gibbs sampler with these conditional distributions then
from which it is easy to deduce that satisfies the difference equation , where and . It is easily checked (using the fact that ) that the solution of this equation for a given value of is
and clearly as , but if π is close to 0 or 1, then α is close to 1 and convergence to this limit can be very slow. If, for example, so that then p128=0.2 if but p128=0.8 if . A further problem is that the series may have appeared to have converged even though it has not – if the process starts from (0, 0), then the number of iterations until is not equal to (0, 0) is geometrically distributed with mean .
Problems of convergence for Markov chain Monte Carlo methods such as the are very difficult and by no means fully solved, and there is not room here for a full discussion of the problems involved. A fuller discussion can be found in O’Hagan (1994, Section 8.65 et seq.), Gilks et al. (1996, especially Chapter 7), Robert (1995), Raftery and Lewis (1996) and Cowles and Carlin (1996). Evidently, it is necessary to look at distributions obtained at various stages and see whether it appears that convergence to an equilibrium distribution is taking place, but even in cases where this is apparently the case, one should be aware that this could be an illusion. In the examples in this chapter, we have stressed the derivation of the algorithms rather than diagnostics as to whether we really have achieved equilibrium.
9.5 Rejection sampling
9.5.1 Description
An important method which does not make use of Markov chain methods but which helps to introduce the Metropolis–Hastings algorithm is rejection sampling or acceptance-rejection sampling. This is a method for use in connection with a density in the case where the normalizing constant K is quite possibly unknown, which, as remarked at the beginning of this chapter, is a typical situation occurring in connection with posterior distributions in Bayesian statistics. To use this method, we need to assume that there is a candidate density from which we can simulate samples and a constant c such that . Then, to obtain a random variable with density we proceed as follows:
1. Generate a variate Y from the density ;
2. Generate a value which is uniformly distributed on (0, 1);
3. Then if we define ; otherwise go back to step 1.
In the discrete case we can argue as follows. In N trials we get the value θ on occasions, of which we retain it times, which is proportional to . Similar considerations apply in the continuous case; a formal proof can be found in Ripley (1987, Section 3.2) or Gamerman and Lopes (2011, Section 5.1).
9.5.2 Example
As a simple example, we can use this method to generate values X from a beta distribution with density f(x)/K, where in the case where and (in fact, we know that in this case and ). We simply note that the density has a maximum at (see Appendix A), so that we can take and h(x)=1, that is, as well as .
9.5.3 Rejection sampling for log-concave distributions
It is often found that inference problems simplify when a distribution is log-concave, that is when the logarithm of the density function is concave (see Walther, 2009). In such a case, we can fit a piecewise linear curve over the density as depicted in Figure 9.4a. In that particular case, there are three pieces to the curve, but in general there could be more. Exponentiating, we find an upper bound for the density itself which consists of curves of the form exp(a+bx), as depicted in Figure 9.4b. As these curves are easily integrated, the proportions of the area under the upper bound which fall under each of the curves are easily calculated. We can then generate a random variable with a density proportional to the upper bound in two stages. We first choose a constituent curve exp(a+bx) between x=t and with a probability proportional to
Figure 9.4 Rejection sampling: (a) log density, (b) un-normalized density, (c) empirical density before rejection and (d) empirical density of v.
After this, a point between t and is chosen as a random variable with a density proportional to exp(a+bx).
For the latter step, we note that such a random variable has distribution function
The inverse function of this is easily shown to be
If then V is uniformly distributed over [0, 1] and we write X=F–1(V), it follows that
so that X has the distribution function F(x).
Figure 9.4c show that simulation using this method produces an empirical distribution which fits closely to the upper envelope of curves we have chosen. We can now use the method of rejection sampling as described earlier to find a sample from the original log-concave density. The empirical distribution thus produced is seen to fit closely to this density in Figure 9.4d.
The choice of the upper envelope can be made to evolve as the sampling proceeds; for details, see Gilks and Wild (1992).
9.5.4 A practical example
In Subsection 9.4.2 headed ‘An example with observed data’ which concerned pump failures, we assumed that the parameter ν was known to be equal to 1.4. It would clearly be better to treat ν as a parameter to be estimated in the same way that we estimated S. Now
since the distribution of given does not depend on ν or S0. Consequently
Unfortunately, the latter expression is not proportional to any standard family for any choice of hyperprior . If, however,
has the form of an exponential distribution of mean μ, we find that
so that
The logarithm of this density is convex. A proof of this follows, but the result can be taken for granted. We observe that the first derivative of the density is
where
(cf. Appendix A.7), while the second derivative is
where is the so-called trigamma function. Since the trigamma function satisfies
(see Whittaker and Watson, 1927, Section 12.16), it is clear that this second derivative is negative for all z> 0, from which convexity follows.
We now find a piecewise linear function which is an upper bound of the density in the manner described earlier. In this case, it suffices to approximate by a three-piece function, as illustrated in Figure 9.4a. This function is constructed by taking a horizontal line tangent at the zero n2 of the derivative of the log-density, where this achieves its maximum h2 and then taking tangents at two nearby points n1 and n3. The joins between the lines occur at the points t1 and t3 where these three curves intersect. Exponentiating, this gives an upper bound for the density itself, consisting of a number of functions of the form exp(a+bx) (the central one being a constant), as illustrated in Figure 9.4b.
In this case, the procedure is very efficient and some 87–88% of candidate points are accepted. The easiest way to understand the details of this procedure is to consider the program:
windows(record=T)
par(mfrow=c(2,2))
N < - 5000
mu < - 1
S < - 1.850
theta < - c(0.0626,0.1181,0.0937,0.1176,0.6115,
0.6130,0.8664,0.8661,1.4958,1.9416)
k < - length(theta)
logf < - function(nu) {
k*(nu/2)*log(S)-k*(nu/2)*log(2)-k*log(gamma(nu/2))+
(nu/2-1)*sum(log(theta))-nu/mu
)
dlogf < - function(nu) {
(k/2)*log(S)-(k/2)*log(2)-0.5*k*digamma(nu/2)+
0.5*sum(log(theta))-1/mu
}
scale < - 3/2
f < - function(nu) exp(logf(nu))
n2 < - uniroot(dlogf,lower=0.01,upper=100)$root
h2 < - logf(n2)
n1 < - n2/scale
h1 < - logf(n1)
b1 < - dlogf(n1)
a1 < - h1-n1*b1
n3 < - n2*scale
h3 < - logf(n3)
b3 < - dlogf(n3)
a3 < - h3-n3*b3
d < - exp(h2)
# First plot; Log density
curve(lo
gf,0.01,3.5,ylim=c(-8,-1),xlab=“a. Log density”)
abline(h=h2)
abline(a1,b1)
abline(a3,b3)
text(0.25,h2+0.3,expression(italic(h)[2])))
text(0.3,-6,
expression(italic(a)[1]+italic(b)[1]*italic(x)))
text(3.25,-7,)
expression(italic(a)[3]+italic(b)[3]*italic(x)))
# End of first plot
f1 < - function(x) exp(a1+b1*x)
f3 < - function(x) exp(a3+b3*x)
t1 < - (h2-a1)/b1
t3 < - (h2-a3)/b3
# Second plot; Un-normalized density
curve(f,0.01,3.5,ylim=c(0,0.225),
xlab=“b. Un-normalized Density”)
abline(h=d)
par(new=T)
curve(f1,0.01,3.5,ylim=c(0,0.225),ylab="",xlab="")
par(new=T)
curve(f3,0.01,3.5,ylim=c(0,0.225),ylab="",xlab="")
abline(v=t1)
abline(v=t3)
text(t1-0.1,0.005,expression(italic(t)[1]))
text(t3-0.1,0.005,expression(italic(t)[3]))
text(0.25,0.2,expression(italic(d)))
text(0.5,0.025,expression(italic(f)[1]))
text(2.5,0.025,expression(italic(f)[3]))
# End of second plot
q1 < - exp(a1)*(exp(b1*t1)-1)/b1
q2 < - exp(h2)*(t3-t1)
q3 < - -exp(a3+b3*t3)/b3
const < - q1 + q2 + q3
p < - c(q1,q2,q3)/const
cat(“p =”,p,“n”)
case < - sample(1:3,N,replace=T,prob=p)
v < - runif(N)
w < - (case==1)*(1/b1)*log(q1*b1*exp(-a1)*v+1)+
(case==2)*(t1+v*(t3-t1))+
)*(1/b3)*(log(q3*b3*exp(-a3)*v+exp(b3*t3)))
dq < - d/const
f1q < - function(x) f1(x)/const
f3q < - function(x) f3(x)/const
h < - function(x) {
dq +
(x
(x>t3)*(f3q(x)-dq)
}
# Third plot; Empirical density
plot(density(w),xlim=c(0.01,4),ylim=c(0,1.1),
xlab=“c. Empirical density before rejection”,
main="")
par(new=T)
curve(h,0.01,4,ylim=c(0,1.1),ylab="",xlab="")
# End of third plot
u < - runif(N)
nu < - w
nu[u>f(nu)/(const*h(nu))] < - NA
nu < - nu[!is.na(nu)]
cat(“Acceptances”,100*length(nu)/length(w),“n”)
int < - integrate(f,0.001,100)$value
fnorm < - function(x) f(x)/int
# Fourth plot
plot(density(nu),xlim=c(0.01,4),ylim=c(0,1.1),
xlab=bquote(paste(“d. Empirical density of ”,nu)),
main="")
par(new=T)
curve(fnorm,0.01,4,ylim=c(0,1.1),ylab="",xlab="")
# End of fourth plot}
It is possible to combine this procedure with the method used earlier to estimate the and S0, and an program to do this can be found on the web. We shall not discuss this matter further here since it is in this case considerably simpler to use WinBUGS as noted in Section 9.7.
9.6 The Metropolis–Hastings algorithm
9.6.1 Finding an invariant distribution
This whole section is largely based on the clear expository article by Chib and Greenberg (1995). A useful general review which covers Bayesian integration problems more generally can be found in Evans and Swartz (1995–1996; 2000).
A topic of major importance when studying the theory of Markov chains is the determination of conditions under which there exists a stationary distribution (sometimes called an invariant distribution), that is, a distribution such that
In the case where the state space is continuous, the sum is, of course, replaced by an integral.
In Markov Chain Monte Carlo methods, often abbreviated as MCMC methods, the opposite problem is encountered. In cases where such methods are to be used, we have a target distribution in mind from which we want to take samples, and we seek transition probabilities for which this target density is the invariant density. If we can find such probabilities, then we can start a Markov chain at an arbitrary starting point and let it run for a long time, after which the probability of being in state θ will be given by the target density .
It turns out to be useful to let be the probability that the chain remains in the same state and then to write
where while
Note that because the chain must go somewhere from its present position θ we have
for all θ. If the equation
is satisfied, we say that the probabilities satisfy time reversibility or detailed balance. This condition means that the probability of starting at θ and ending at when the initial probabilities are given by is the same as that of starting at and ending at θ. It turns out that when it is satisfied then gives a set of transition probabilities with as an invariant density. For
The result follows very similarly in cases where the state space is continuous.
The aforementioned proof shows that all we need to find the required transition probabilities is to find probabilities which are time reversible.
9.6.2 The Metropolis–Hastings algorithm
The Metropolis–Hastings algorithm begins by drawing from a candidate density as in rejection sampling, but, because we are considering Markov chains, the density depends on the current state of the process. We denote the candidate density by and we suppose that . If it turns out that the density q(y|x) is itself time reversible, then we need look no further. If, however, we find that
then it appears that the process moves from θ to too often and from to θ too rarely. We can reduce the number of moves from θ to by introducing a probability , called the probability of moving, that the move is made. In order to achieve time reversibility, we take to be such that the detailed balance equation
holds and consequently
We do not want to reduce the number of moves from to θ in such a case, so we take , and similarly in the case where the inequality is reversed and we have
It is clear that a general formula is
so that the probability of going from state θ to state is , while the probability that the chain remains in its present state θ is
The matrix of transition probabilities is thus given by
The aforementioned argument assumes that the state space is discrete, but this is just to make the formulae easier to take in – if the state space is continuous the same argument works with suitable adjustments, such as replacing sums by integrals.
Note that it suffices to know the target density up to a constant multiple, because it appears both in the numerator and in the denominator of the expression for . Further, if the candidate-generating density is symmetric, so that , the reduces to
This is the form which Metropolis et al. (1953) originally proposed, while the general form was proposed by Hastings (1970).
We can summarize the Metropolis–Hastings algorithm as follows:
1. Sample a candidate point from a proposal distribution1 .
2. Calculate
3. Generate a value which is uniformly distributed on (0, 1);
4. Then if we define ; otherwise we define
5. Return the sequence , , … , .
Of course we ignore the values of until the chain has converged to equilibrium. Unfortunately, it is a difficult matter to assess how many observations need to be ignored. One suggestion is that chains should be started from various widely separated starting points and the variation within and between the sampled draws compared; see Gelman and Rubin (1992). We shall not go further into this question; a useful reference is Gamerman and Lopes (2011, Section 5.4).
9.6.3 Choice of a candidate density
We need to specify a candidate density before a Markov chain to implement the Metropolis–Hastings algorithm. Quite a number of suggestions have been made, among them:
Random walk cha
ins in which so that takes the form where has the density . The density is usually symmetric, so that , and in such cases the formula for the probability of moving reduces to
Independence chains in which does not depend on the current state θ. Although it sounds at first as if the transition probability does not depend on the current state θ, in fact it does through the probability of moving which is
In cases where we are trying to sample from a posterior distribution, one possibility is to take to be the prior, so that the ratio becomes a ratio of likelihoods
Note that there is no need to find the constant of proportionality in the posterior distribution.
Autoregressive chains in which where λ is random with density q3. It follows that . Setting b=–1 produces chains that are reflected about the point a and ensures negative correlation between successive elements of the chain.
A Metropolis–Hastings acceptance–rejection algorithm due to Tierney (1994) is described by Chib and Greenberg (1995). We will not go into this algorithm here.
9.6.4 Example
Following Chib and Greenberg (1995), we illustrate the method by considering ways of simulating the bivariate normal distribution , where we take
If we define
so that (giving the Choleski factorization of ), it is easily seen that if we take
where the components of are independent standard normal variates, then has the required distribution, so that the distribution is easily simulated without using Markov chain Monte Carlo methods. Nevertheless, a simple illustration of the use of such methods is provided by this distribution. One possible set-up is provided by a random walk distribution , where the components Z1 and Z2 of are taken as independent uniformly distributed random variables with and and taking the probability of a move as