by Peter M Lee
3.10 Approximations based on the likelihood
3.10.1 Maximum likelihood
Suppose, as usual, that we have independent observations whose distribution depends on an unknown parameter θ about which we want to make inferences. Sometimes it is useful to quote the posterior mode, that is, that value of θ at which the posterior density is a maximum, as a single number giving some idea of the location of the posterior distribution of θ; it could be regarded as the ultimate limit of the idea of an HDR. However, some Bayesians are opposed to the use of any single number in this way [see Box and Tiao (1992, Section A5.6)].
If the likelihood dominates the prior, the posterior mode will occur very close to the point at which the likelihood is a maximum. Use of is known as the method of maximum likelihood and is originally due to Fisher (1922). One notable point about maximum likelihood estimators is that if is any function of θ then it is easily seen that
because the point at which is a maximum is not affected by how it is labelled. This invariance is not true of the exact position of the maximum of the posterior, nor indeed of HDRs, because these are affected by the factor .
You should note that the maximum likelihood estimator is often found by the Newton–Raphson method. Suppose that the likelihood is and that its logarithm (in which it is often easier to work) is . In order to simplify the notation, we may sometimes omit explicit reference to the data and write for . We seek such that
or equivalently that it satisfies the so-called likelihood equation
so that the score vanishes.
3.10.2 Iterative methods
If is an approximation to then using Taylor’s Theorem
where is between and . In most cases, will not differ much from and neither will differ much from its expectation over . However,
where is Fisher’s information which was introduced earlier in Section 3.3 in connection with Jeffreys’ rule. We note that, although does depend on the value observed, the information depends on the distribution of the random variable rather than on the value observed on this particular occasion, and to this extent the notation, good though it is for other purposes, is misleading. However, the value of does depend on , because does.
It follows that as the value of tends to , so that a better approximation than will usually be provided by either of
the Newton–Raphson method, or by
the method of scoring for parameters. The latter method was first published in a paper by Fisher (1925a).
It has been shown by Kale (1961) that the method of scoring will usually be the quicker process for large n unless high accuracy is ultimately required. In perverse cases both methods can fail to converge or can converge to a root which does not give the absolute maximum.
3.10.3 Approximation to the posterior density
We can also observe that, since , in the neighbourhood of
so that approximately
Hence, the likelihood is approximately proportional to an density, and so approximately to an density. We can thus construct approximate HDRs by using this approximation to the likelihood and assuming that the likelihood dominates the prior.
3.10.4 Examples
Normal variance. For the normal variance (with known mean θ)
where , so that
In this case, the likelihood equation is solved without recourse to iteration to give
Further
Alternatively
and as , so that , we have
Of course, there is no need to use an iterative method to find in this case, but the difference between the formulae for and is illustrative of the extent to which the Newton–Raphson method and the method of scoring differ from one another. The results suggest that we approximate the posterior distribution of [which we found to be if we took a conjugate prior] by
With the data we considered in Section 2.8 on HDRs for the normal variance, we had n = 20 and S = 664, so that 2S2/n3=110.224. The approximation would suggest a 95% HDR between , that is the interval (13, 54) as opposed to the interval (19, 67) which was found in Section 2.8.
This example is deceptively simple – the method is of greatest use when analytic solutions are difficult or impossible. Further, the accuracy is greater when sample sizes are larger.
Poisson distribution. We can get another deceptively simple example by supposing that is an n-sample from and that , so that (as shown in Section 3.4)
and the likelihood equation is again solved without iteration, this time giving . Further
and . This suggests that we can approximate the posterior of λ (which we found to be if we took a conjugate prior) by
Cauchy distribution. Suppose is an n-sample from C(), so that
It is easily seen that
On substituting and using standard reduction formulae, it follows that
from which it can be seen that successive approximations to can be found using the method of scoring by setting
The iteration could, for example, be started from the sample median, that is, the observation which is in the middle when they are arranged in increasing order. For small n the iteration may not converge, or may converge to the wrong answer (see Barnett, 1966), but the process usually behaves satisfactorily.
Real life data from a Cauchy distribution are rarely encountered, but the following values are simulated from a C distribution (the value of θ being, in fact, 0):
The sample median of the n = 9 values is 0.397. If we take this as our first approximation to , then
and all subsequent equal 0.179 which is, in fact, the correct value of . Since , an approximate 95% HDR for θ is , that is the interval (–0.74, 1.10). This does include the true value, which we happen to know is 0, but of course the value of n has been chosen unrealistically small in order to illustrate the method without too much calculation.
It would also be possible in this case to carry out an iteration based on the Newton–Raphson method
using the above formula for , but as explained earlier, it is in general better to use the method of scoring.
3.10.5 Extension to more than one parameter
If we have two parameters, say θ and , which are both unknown, a similar argument shows that the maximum likelihood occurs at , where
Similarly, if is an approximation, a better one is , where
where the derivatives are evaluated at and the matrix of second derivatives can be replaced by its expectation, which is minus the information matrix as defined in Section 3.3 on Jeffreys’ rule.
Further, the likelihood and hence the posterior can be approximated by a bivariate normal distribution of mean and variance–covariance matrix whose inverse is equal to minus the matrix of second derivatives (or the information matrix) evaluated at .
All of this extends in an obvious way to the case of more than two unknown parameters.
3.10.6 Example
We shall consider only one, very simple, case, that of a normal distribution of unknown mean and variance. In this case,
where , so that
Further, it is easily seen that
which at reduces to
Because the off-diagonal elements vanish, the posteriors for θ and are approximately independent. Further, we see that approximately
In fact, we found in Section 2.12 on normal mean and variance both unknown that with standard reference priors, the posterior for θ and is a normal/chi-squared distribution and the marginals are such that
which implies that the means and variances are
This shows that for large n the approximation is indeed valid.
3.11 Reference posterior distributions
3.11.1 The information provided by an experiment
Bernardo (1979) suggested another way of arriving at a reference standard for Bayesian theory. The starting point for this is that the log-likelihood ratio can be regarded as the information provided by the observation x for discrimination in favour of against (cf. Good, 1950, Section 6.1). This led Kullback and Leibler (1951) to define the mean info
rmation in such data to be
(cf. Kullback, 1968, and Barnett, 1999, Section 8.6).2 Note that although there is a relationship between information as defined here and Fisher’s information I as defined in Section 3.3 earlier (see Kullback, 1968, Chapter 2, Section 6), you are best advised to think of this as a quite separate notion. It has in common with Fisher’s information the property that it depends on the distribution of the data rather than on any particular value of it.
Following this, Lindley (1956) defined the expected amount of information that the observation x will provide about an unknown parameter θ when the prior density for θ is to be
The observation x is, of course, random, and hence we can define the expected information that the observation x will provide to be
(a similar expression occurs in Shannon, 1948, Section 24). Two obviously equivalent expressions are
It is easily seen using the usual change-of-variable rule that the information defined by this expression is invariant under a one-to-one transformation. It can be used as a basis for Bayesian design of experiments. It has various appealing properties, notably that with equality if and only if does not depend on θ (see Hardy, Littlewood and Pólya, 1952, Theorem 205). Further, it turns out that if an experiment consists of two observations, then the total information it provides is the information provided by one observation plus the mean amount provided by the second given the first (as shown by Lindley, 1956).
We now define to be the amount of information about θ to be expected from n independent observations with the same distribution as x. By making an infinite number of observations one would get to know the precise value of θ, and consequently measures the amount of information about θ when the prior is . It seems natural to define ‘vague initial knowledge’ about θ as that described by that density which maximizes the missing information.
In the continuous case, we usually find that for all prior , and hence we need to use a limiting process. This is to be expected since an infinite amount of information would be required to know a real number exactly. We define to be the posterior density corresponding to that prior which maximizes (it can be shown that in reasonable cases a unique maximizing function exists). Then the reference posterior is defined as the limit . Functions can converge in various senses, and so we need to say what we mean by the limit of these densities. In fact we have to take convergence to mean convergence of the distribution functions at all points at which the limiting distribution function is continuous.
We can then define a reference prior as any prior which satisfies . This rather indirect definition is necessary because convergence of a set of posteriors does not necessarily imply that the corresponding priors converge in the same sense. To see this, consider a case where the observations consist of a single binomial variable and the sequence of priors is Be(1/n, 1/n). Then the posteriors are Be(x+1/n, k–x+1/n) which clearly converge to Be(x, k–x), which is the posterior corresponding to the Haldane prior Be(0, 0). However, the priors themselves have distribution functions which approach a step function with steps of at 0 and 1, and that corresponds to a discrete prior distribution which gives probability each to the values 0 and 1.
To proceed further, we suppose that is the result of our n independent observations of x and we define entropy by
(this is a function of a distribution for θ and is not a function of any particular value of θ). Then using and we see that
(the last equation results from simple manipulations as exp and log are inverse functions). It follows that we can write In in the form
It can be shown using the calculus of variations that the information is maximized when (see Bernardo and Smith, 1994, Section 5.4.2). It follows (provided the functions involved are well behaved) that the sequence of densities
approaches the reference prior. There is a slight difficulty in that the posterior density which figures in the above expression depends on the prior, but we know that this dependence dies away as .
3.11.2 Reference priors under asymptotic normality
In cases where the approximations derived in Section 3.10 are valid, the posterior distribution is which by the additive property of Fisher’s information is . Now it is easily seen that the entropy of an density is
(writing as ) from which it follows that
to the extent to which the approximation established in the last section is correct. Thus, we have
since the approximation in the previous section shows that is negligible except where is close to θ. It follows on dropping a constant that
and so we have another justification for Jeffreys’ prior which we first introduced in Section 3.3.
If this were all that this method could achieve, it would not be worth the aforementioned discussion. Its importance lies in that it can be used for a wider class of problems and that further it gives sensible answers when we have nuisance parameters.
3.11.3 Uniform distribution of unit length
To see the first point, consider the case of a uniform distribution over an interval of unit length with unknown centre, so that we have observations , and as usual let be the result of our n independent observations of x. Much as in Section 3.5, we find that if m=min xi and M=max xi then the posterior is
For a large sample, the interval in which this is nonzero will be small and (assuming suitable regularity) will not vary much in it, so that asymptotically . It follows that
which is asymptotically equal to
Since is the maximum of n observations uniformly distributed on [0, 1] we have
from which it follows that the density of is proportional to un–1, so that and hence . Similarly we find that , so that
Because this does not depend on θ it follows that does not depend on θ and so is uniform. Taking limits, the reference prior is also uniform.
Note that in this case the posterior is very far from normality, so that the theory cannot be applied as in Subsection 3.11.2, headed ‘Reference priors under asymptotic normality’, but nevertheless a satisfactory reference prior can be devised.
3.11.4 Normal mean and variance
When we have two parameters, as in the case of the mean and variance of an distribution, we often want to make inferences about the mean θ, so that is a nuisance parameter. In such a case, we have to choose a conditional prior for the nuisance parameter which describes personal opinions, previous observations, or else is ‘diffuse’ in the sense of the priors we have been talking about.
When we want to describe diffuse opinions about given θ, we would expect, for the aforementioned reasons, to maximize the missing information about given θ. This results in the sequence
Now we found in Section 3.10 that in the case where we have a sample of size n from a normal distribution, the asymptotic posterior distribution of is N(S/n, 2S2/n3), which we may write as , and consequently (using the form derived at the start of the subsection on ‘Reference priors under asymptotic normality’) its entropy is
It follows that
In the limit we get that
In this case, the posterior for the mean is well approximated by , where and , so that the entropy is
We thus get
using the facts that is negligible except where is close to and that, of course, must equal . We note that if does not depend on θ, and so, in particular, in the case where , the density is a constant and in the limit the reference prior is also constant, so giving the usual reference prior . It then follows that the joint reference prior is
This, as we noted at the end of Section 3.3 is not the same as the prior given by the two-parameter version of Jeffreys’ rule. If we want to make inferences about with θ being the nuisance parameter, we obtain the same reference prior.
There is a temptation to think that whatever parameters we adopt we will get the same reference prior, but this is not the case. If we define the standard deviation as and the coefficient of variation or standardized mean as , then we find
(see Bernardo and Smith, 1994, Examples 5
.17 and 5.26) which corresponds to
3.11.5 Technical complications
There are actually some considerable technical complications about the process of obtaining reference posteriors and priors in the presence of nuisance parameters, since some of the integrals may be infinite. It is usually possible to deal with this difficulty by restricting the parameter of interest to a finite range and then increasing this range sequentially, so that in the limit all possible values are included. For details, see Bernardo and Smith (1994, Section 5.4.4).
3.12 Exercises on Chapter 3
1. Laplace claimed that the probability that an event which has occurred n times, and has not hitherto failed, will occur again is (n+1)/(n+2) [see Laplace (1774)], which is sometimes known as Laplace’s rule of succession. Suggest grounds for this assertion.