Book Read Free

Bayesian Statistics (4th ed)

Page 27

by Peter M Lee


  Another artificial example quoted by O’Hagan (1994, Example 8.11) arises if where and take only the values 0 and 1 and have joint density

  for some so that

  and similarly for given . It is not hard to see that if we operate a Gibbs sampler with these conditional distributions then

  from which it is easy to deduce that satisfies the difference equation , where and . It is easily checked (using the fact that ) that the solution of this equation for a given value of is

  and clearly as , but if π is close to 0 or 1, then α is close to 1 and convergence to this limit can be very slow. If, for example, so that then p128=0.2 if but p128=0.8 if . A further problem is that the series may have appeared to have converged even though it has not – if the process starts from (0, 0), then the number of iterations until is not equal to (0, 0) is geometrically distributed with mean .

  Problems of convergence for Markov chain Monte Carlo methods such as the are very difficult and by no means fully solved, and there is not room here for a full discussion of the problems involved. A fuller discussion can be found in O’Hagan (1994, Section 8.65 et seq.), Gilks et al. (1996, especially Chapter 7), Robert (1995), Raftery and Lewis (1996) and Cowles and Carlin (1996). Evidently, it is necessary to look at distributions obtained at various stages and see whether it appears that convergence to an equilibrium distribution is taking place, but even in cases where this is apparently the case, one should be aware that this could be an illusion. In the examples in this chapter, we have stressed the derivation of the algorithms rather than diagnostics as to whether we really have achieved equilibrium.

  9.5 Rejection sampling

  9.5.1 Description

  An important method which does not make use of Markov chain methods but which helps to introduce the Metropolis–Hastings algorithm is rejection sampling or acceptance-rejection sampling. This is a method for use in connection with a density in the case where the normalizing constant K is quite possibly unknown, which, as remarked at the beginning of this chapter, is a typical situation occurring in connection with posterior distributions in Bayesian statistics. To use this method, we need to assume that there is a candidate density from which we can simulate samples and a constant c such that . Then, to obtain a random variable with density we proceed as follows:

  1. Generate a variate Y from the density ;

  2. Generate a value which is uniformly distributed on (0, 1);

  3. Then if we define ; otherwise go back to step 1.

  In the discrete case we can argue as follows. In N trials we get the value θ on occasions, of which we retain it times, which is proportional to . Similar considerations apply in the continuous case; a formal proof can be found in Ripley (1987, Section 3.2) or Gamerman and Lopes (2011, Section 5.1).

  9.5.2 Example

  As a simple example, we can use this method to generate values X from a beta distribution with density f(x)/K, where in the case where and (in fact, we know that in this case and ). We simply note that the density has a maximum at (see Appendix A), so that we can take and h(x)=1, that is, as well as .

  9.5.3 Rejection sampling for log-concave distributions

  It is often found that inference problems simplify when a distribution is log-concave, that is when the logarithm of the density function is concave (see Walther, 2009). In such a case, we can fit a piecewise linear curve over the density as depicted in Figure 9.4a. In that particular case, there are three pieces to the curve, but in general there could be more. Exponentiating, we find an upper bound for the density itself which consists of curves of the form exp(a+bx), as depicted in Figure 9.4b. As these curves are easily integrated, the proportions of the area under the upper bound which fall under each of the curves are easily calculated. We can then generate a random variable with a density proportional to the upper bound in two stages. We first choose a constituent curve exp(a+bx) between x=t and with a probability proportional to

  Figure 9.4 Rejection sampling: (a) log density, (b) un-normalized density, (c) empirical density before rejection and (d) empirical density of v.

  After this, a point between t and is chosen as a random variable with a density proportional to exp(a+bx).

  For the latter step, we note that such a random variable has distribution function

  The inverse function of this is easily shown to be

  If then V is uniformly distributed over [0, 1] and we write X=F–1(V), it follows that

  so that X has the distribution function F(x).

  Figure 9.4c show that simulation using this method produces an empirical distribution which fits closely to the upper envelope of curves we have chosen. We can now use the method of rejection sampling as described earlier to find a sample from the original log-concave density. The empirical distribution thus produced is seen to fit closely to this density in Figure 9.4d.

  The choice of the upper envelope can be made to evolve as the sampling proceeds; for details, see Gilks and Wild (1992).

  9.5.4 A practical example

  In Subsection 9.4.2 headed ‘An example with observed data’ which concerned pump failures, we assumed that the parameter ν was known to be equal to 1.4. It would clearly be better to treat ν as a parameter to be estimated in the same way that we estimated S. Now

  since the distribution of given does not depend on ν or S0. Consequently

  Unfortunately, the latter expression is not proportional to any standard family for any choice of hyperprior . If, however,

  has the form of an exponential distribution of mean μ, we find that

  so that

  The logarithm of this density is convex. A proof of this follows, but the result can be taken for granted. We observe that the first derivative of the density is

  where

  (cf. Appendix A.7), while the second derivative is

  where is the so-called trigamma function. Since the trigamma function satisfies

  (see Whittaker and Watson, 1927, Section 12.16), it is clear that this second derivative is negative for all z> 0, from which convexity follows.

  We now find a piecewise linear function which is an upper bound of the density in the manner described earlier. In this case, it suffices to approximate by a three-piece function, as illustrated in Figure 9.4a. This function is constructed by taking a horizontal line tangent at the zero n2 of the derivative of the log-density, where this achieves its maximum h2 and then taking tangents at two nearby points n1 and n3. The joins between the lines occur at the points t1 and t3 where these three curves intersect. Exponentiating, this gives an upper bound for the density itself, consisting of a number of functions of the form exp(a+bx) (the central one being a constant), as illustrated in Figure 9.4b.

  In this case, the procedure is very efficient and some 87–88% of candidate points are accepted. The easiest way to understand the details of this procedure is to consider the program:

  windows(record=T)

  par(mfrow=c(2,2))

  N < - 5000

  mu < - 1

  S < - 1.850

  theta < - c(0.0626,0.1181,0.0937,0.1176,0.6115,

  0.6130,0.8664,0.8661,1.4958,1.9416)

  k < - length(theta)

  logf < - function(nu) {

  k*(nu/2)*log(S)-k*(nu/2)*log(2)-k*log(gamma(nu/2))+

  (nu/2-1)*sum(log(theta))-nu/mu

  )

  dlogf < - function(nu) {

  (k/2)*log(S)-(k/2)*log(2)-0.5*k*digamma(nu/2)+

  0.5*sum(log(theta))-1/mu

  }

  scale < - 3/2

  f < - function(nu) exp(logf(nu))

  n2 < - uniroot(dlogf,lower=0.01,upper=100)$root

  h2 < - logf(n2)

  n1 < - n2/scale

  h1 < - logf(n1)

  b1 < - dlogf(n1)

  a1 < - h1-n1*b1

  n3 < - n2*scale

  h3 < - logf(n3)

  b3 < - dlogf(n3)

  a3 < - h3-n3*b3

  d < - exp(h2)

  # First plot; Log density

  curve(lo
gf,0.01,3.5,ylim=c(-8,-1),xlab=“a. Log density”)

  abline(h=h2)

  abline(a1,b1)

  abline(a3,b3)

  text(0.25,h2+0.3,expression(italic(h)[2])))

  text(0.3,-6,

  expression(italic(a)[1]+italic(b)[1]*italic(x)))

  text(3.25,-7,)

  expression(italic(a)[3]+italic(b)[3]*italic(x)))

  # End of first plot

  f1 < - function(x) exp(a1+b1*x)

  f3 < - function(x) exp(a3+b3*x)

  t1 < - (h2-a1)/b1

  t3 < - (h2-a3)/b3

  # Second plot; Un-normalized density

  curve(f,0.01,3.5,ylim=c(0,0.225),

  xlab=“b. Un-normalized Density”)

  abline(h=d)

  par(new=T)

  curve(f1,0.01,3.5,ylim=c(0,0.225),ylab="",xlab="")

  par(new=T)

  curve(f3,0.01,3.5,ylim=c(0,0.225),ylab="",xlab="")

  abline(v=t1)

  abline(v=t3)

  text(t1-0.1,0.005,expression(italic(t)[1]))

  text(t3-0.1,0.005,expression(italic(t)[3]))

  text(0.25,0.2,expression(italic(d)))

  text(0.5,0.025,expression(italic(f)[1]))

  text(2.5,0.025,expression(italic(f)[3]))

  # End of second plot

  q1 < - exp(a1)*(exp(b1*t1)-1)/b1

  q2 < - exp(h2)*(t3-t1)

  q3 < - -exp(a3+b3*t3)/b3

  const < - q1 + q2 + q3

  p < - c(q1,q2,q3)/const

  cat(“p =”,p,“n”)

  case < - sample(1:3,N,replace=T,prob=p)

  v < - runif(N)

  w < - (case==1)*(1/b1)*log(q1*b1*exp(-a1)*v+1)+

  (case==2)*(t1+v*(t3-t1))+

  )*(1/b3)*(log(q3*b3*exp(-a3)*v+exp(b3*t3)))

  dq < - d/const

  f1q < - function(x) f1(x)/const

  f3q < - function(x) f3(x)/const

  h < - function(x) {

  dq +

  (x
  (x>t3)*(f3q(x)-dq)

  }

  # Third plot; Empirical density

  plot(density(w),xlim=c(0.01,4),ylim=c(0,1.1),

  xlab=“c. Empirical density before rejection”,

  main="")

  par(new=T)

  curve(h,0.01,4,ylim=c(0,1.1),ylab="",xlab="")

  # End of third plot

  u < - runif(N)

  nu < - w

  nu[u>f(nu)/(const*h(nu))] < - NA

  nu < - nu[!is.na(nu)]

  cat(“Acceptances”,100*length(nu)/length(w),“n”)

  int < - integrate(f,0.001,100)$value

  fnorm < - function(x) f(x)/int

  # Fourth plot

  plot(density(nu),xlim=c(0.01,4),ylim=c(0,1.1),

  xlab=bquote(paste(“d. Empirical density of ”,nu)),

  main="")

  par(new=T)

  curve(fnorm,0.01,4,ylim=c(0,1.1),ylab="",xlab="")

  # End of fourth plot}

  It is possible to combine this procedure with the method used earlier to estimate the and S0, and an program to do this can be found on the web. We shall not discuss this matter further here since it is in this case considerably simpler to use WinBUGS as noted in Section 9.7.

  9.6 The Metropolis–Hastings algorithm

  9.6.1 Finding an invariant distribution

  This whole section is largely based on the clear expository article by Chib and Greenberg (1995). A useful general review which covers Bayesian integration problems more generally can be found in Evans and Swartz (1995–1996; 2000).

  A topic of major importance when studying the theory of Markov chains is the determination of conditions under which there exists a stationary distribution (sometimes called an invariant distribution), that is, a distribution such that

  In the case where the state space is continuous, the sum is, of course, replaced by an integral.

  In Markov Chain Monte Carlo methods, often abbreviated as MCMC methods, the opposite problem is encountered. In cases where such methods are to be used, we have a target distribution in mind from which we want to take samples, and we seek transition probabilities for which this target density is the invariant density. If we can find such probabilities, then we can start a Markov chain at an arbitrary starting point and let it run for a long time, after which the probability of being in state θ will be given by the target density .

  It turns out to be useful to let be the probability that the chain remains in the same state and then to write

  where while

  Note that because the chain must go somewhere from its present position θ we have

  for all θ. If the equation

  is satisfied, we say that the probabilities satisfy time reversibility or detailed balance. This condition means that the probability of starting at θ and ending at when the initial probabilities are given by is the same as that of starting at and ending at θ. It turns out that when it is satisfied then gives a set of transition probabilities with as an invariant density. For

  The result follows very similarly in cases where the state space is continuous.

  The aforementioned proof shows that all we need to find the required transition probabilities is to find probabilities which are time reversible.

  9.6.2 The Metropolis–Hastings algorithm

  The Metropolis–Hastings algorithm begins by drawing from a candidate density as in rejection sampling, but, because we are considering Markov chains, the density depends on the current state of the process. We denote the candidate density by and we suppose that . If it turns out that the density q(y|x) is itself time reversible, then we need look no further. If, however, we find that

  then it appears that the process moves from θ to too often and from to θ too rarely. We can reduce the number of moves from θ to by introducing a probability , called the probability of moving, that the move is made. In order to achieve time reversibility, we take to be such that the detailed balance equation

  holds and consequently

  We do not want to reduce the number of moves from to θ in such a case, so we take , and similarly in the case where the inequality is reversed and we have

  It is clear that a general formula is

  so that the probability of going from state θ to state is , while the probability that the chain remains in its present state θ is

  The matrix of transition probabilities is thus given by

  The aforementioned argument assumes that the state space is discrete, but this is just to make the formulae easier to take in – if the state space is continuous the same argument works with suitable adjustments, such as replacing sums by integrals.

  Note that it suffices to know the target density up to a constant multiple, because it appears both in the numerator and in the denominator of the expression for . Further, if the candidate-generating density is symmetric, so that , the reduces to

  This is the form which Metropolis et al. (1953) originally proposed, while the general form was proposed by Hastings (1970).

  We can summarize the Metropolis–Hastings algorithm as follows:

  1. Sample a candidate point from a proposal distribution1 .

  2. Calculate

  3. Generate a value which is uniformly distributed on (0, 1);

  4. Then if we define ; otherwise we define

  5. Return the sequence , , … , .

  Of course we ignore the values of until the chain has converged to equilibrium. Unfortunately, it is a difficult matter to assess how many observations need to be ignored. One suggestion is that chains should be started from various widely separated starting points and the variation within and between the sampled draws compared; see Gelman and Rubin (1992). We shall not go further into this question; a useful reference is Gamerman and Lopes (2011, Section 5.4).

  9.6.3 Choice of a candidate density

  We need to specify a candidate density before a Markov chain to implement the Metropolis–Hastings algorithm. Quite a number of suggestions have been made, among them:

  Random walk cha
ins in which so that takes the form where has the density . The density is usually symmetric, so that , and in such cases the formula for the probability of moving reduces to

  Independence chains in which does not depend on the current state θ. Although it sounds at first as if the transition probability does not depend on the current state θ, in fact it does through the probability of moving which is

  In cases where we are trying to sample from a posterior distribution, one possibility is to take to be the prior, so that the ratio becomes a ratio of likelihoods

  Note that there is no need to find the constant of proportionality in the posterior distribution.

  Autoregressive chains in which where λ is random with density q3. It follows that . Setting b=–1 produces chains that are reflected about the point a and ensures negative correlation between successive elements of the chain.

  A Metropolis–Hastings acceptance–rejection algorithm due to Tierney (1994) is described by Chib and Greenberg (1995). We will not go into this algorithm here.

  9.6.4 Example

  Following Chib and Greenberg (1995), we illustrate the method by considering ways of simulating the bivariate normal distribution , where we take

  If we define

  so that (giving the Choleski factorization of ), it is easily seen that if we take

  where the components of are independent standard normal variates, then has the required distribution, so that the distribution is easily simulated without using Markov chain Monte Carlo methods. Nevertheless, a simple illustration of the use of such methods is provided by this distribution. One possible set-up is provided by a random walk distribution , where the components Z1 and Z2 of are taken as independent uniformly distributed random variables with and and taking the probability of a move as

 

‹ Prev