Book Read Free

Bayesian Statistics (4th ed)

Page 10

by Peter M Lee


  Indeed it turns out that

  where the symbol means ‘is approximately distributed as’ (see Section 3.10 on ‘Approximations based on the Likelihood’). To the extent that this is so, it follows that the transformation puts the likelihood in data translated form, and hence that a uniform prior in , that is, a prior for π, is an appropriate reference prior.

  3.2.4 Conclusion

  The three aforementioned possibilities are not the only ones that have been suggested. For example, Zellner (1977) suggested the use of a prior

  [see also the references in Berger (1985, Section 3.3.4)]. However, this is difficult to work with because it is not in the conjugate family.

  In fact, the three suggested conjugate priors Be(0, 0), and Be(1, 1) (and for that matter Zellner’s prior) do not differ enough to make much difference with even a fairly small amount of data, and the aforementioned discussion on the problem of a suitable reference prior may be too lengthy, except for the fact that the discussion does underline the difficulty in giving a precise meaning to the notion of a prior distribution that represents ‘knowing nothing’. It may be worth your while trying a few examples to see how little difference there is between the possible priors in particular cases.

  In practice, the use of Be(0, 0) is favoured here, although it must be admitted that one reason for this is that it ties in with the use of HDRs found from tables of values of F based on HDRs for log F and hence obviates the need for a separate set of tables for the beta distribution. But in any case, we could use the method based on these tables and the results would not be very different from those based on any other appropriate tables.

  3.3 Jeffreys’ rule

  3.3.1 Fisher’s information

  In Section 2.1 on the nature of Bayesian inference, the log-likelihood function was defined as

  In this section, we shall sometimes write l for , L for and p for the probability density function . The fact that the likelihood can be multiplied by any constant implies that the log-likelihood contains an arbitrary additive constant.

  An important concept in classical statistics which arises, for example, in connection with the Cramèr-Rao bound for the variance of an unbiased estimator, is that of the information provided by an experiment which was defined by Fisher (1925a) as

  the expectation being taken over all possible values of x for fixed θ. It is important to note that the information depends on the distribution of the data rather than on any particular value of it, so that if we carry out an experiment and observe, for example, that , then the information is no different from the information if ; basically it is to do with what can be expected from an experiment before rather than after it has been performed. It may be helpful to note that strictly speaking it should be denoted

  Because the log-likelihood differs from by a constant, all their derivatives are equal, and we can equally well define the information by

  It is useful to prove two lemmas. In talking about these, you may find it useful to use a terminology frequently employed by classical statisticians. The first derivative of the log-likelihood is sometimes called the score; see Lindgren (1993, Section 4.5.4).

  Lemma 3.1

  .

  Proof. From the definition

  since in any reasonable case it makes no difference whether differentiation with respect to θ is carried out inside or outside the integral with respect to x.

  Lemma 3.2

  Proof. Again differentiating under the integral sign

  as required.

  3.3.2 The information from several observations

  If we have n independent observations , then the probability densities multiply, so the log-likelihoods add. Consequently, if we define

  then by linearity of expectation

  where x is any one of the xi. This accords with the intuitive idea that n times as many observations should give us n times as much information about the value of an unknown parameter.

  3.3.3 Jeffreys’ prior

  In a Bayesian context, the important thing to note is that if we transform the unknown parameter θ to then

  Squaring and taking expectations over values of x (and noting that does not depend on x), it follows that

  It follows from this that if a prior density

  is used, then by the usual change-of-variable rule

  It is because of this property that Jeffreys (1961, Section 3.10) suggested that the density

  provided a suitable reference prior (the use of this prior is sometimes called Jeffreys’ rule). This rule has the valuable property that the prior is invariant in that, whatever scale we choose to measure the unknown parameter in, the same prior results when the scale is transformed to any particular scale. This seems a highly desirable property of a reference prior. In Jeffreys’ words, ‘any arbitrariness in the choice of parameters could make no difference to the results’.

  3.3.4 Examples

  Normal mean. For the normal mean with known variance, the log-likelihood is

  so that

  which does not depend on x, so that

  implying that we should take a prior

  which is the rule suggested earlier for a reference prior.

  Normal variance. In the case of the normal variance

  so that

  Because , it follows that

  implying that we should take a prior

  which again is the rule suggested earlier for a reference prior.

  Binomial parameter. In this case,

  so that

  Because , it follows that

  implying that we should take a prior

  that is , that is π has an arc-sine distribution, which is one of the rules suggested earlier as possible choices for the reference prior in this case.

  3.3.5 Warning

  While Jeffreys’ rule is suggestive, it cannot be applied blindly. Apart from anything else, the integral defining the information can diverge; it is easily seen to do so for the Cauchy distribution C(), for example. It should be thought of as a guideline that is well worth considering, particularly if there is no other obvious way of finding a prior distribution. Generally speaking, it is less useful if there are more unknown parameters than one, although an outline of the generalization to that case is given later for reference.

  3.3.6 Several unknown parameters

  If there are several unknown parameters , the information provided by a single observation is defined as a matrix, the element in row i, column j, of which is

  As in the one parameter case, if there are several observations , we get

  If we transform to new parameters where , we see that if is the matrix the element in row i, column j of which is

  then it is quite easy to see that

  where JT is the transpose of , and hence that the determinant of the information matrix satisfies

  Because is the Jacobian determinant, it follows that

  provides an invariant prior for the multi-parameter case.

  3.3.7 Example

  Normal mean and variance both unknown. In this case, the log-likelihood is

  so that

  and hence

  Because and , it follows that

  and, so that

  This implies that we should use the reference prior

  It should be noted that this is not the same as the reference prior recommended earlier for use in this case, namely,

  However, I would still prefer to use the prior recommended earlier. The invariance argument does not take into account the fact that in most such problems your judgement about the mean would not be affected by anything you were told about the variance or vice versa, and on those grounds it seems reasonable to take a prior which is the product of the reference priors for the mean and the variance separately.

  The example underlines the fact that we have to be rather careful about the choice of a prior in multi-parameter cases. It is also worth mentioning that it is very often the case that when there are parameters which can be thought of as representing ‘loc
ation’ and ‘scale’, respectively, then it would usually be reasonable to think of these parameters as being independent a priori, just as suggested earlier in the normal case.

  3.4 The Poisson distribution

  3.4.1 Conjugate prior

  A discrete random variable x is said to have a Poisson distribution of mean λ if it has the density

  This distribution often occurs as a limiting case of the binomial distribution as the index and the parameter but their product (see Exercise 6 in Chapter 1). It is thus a useful model for rare events, such as the number of radioactive decays in a fixed time interval, when we can split the interval into an arbitrarily large number of sub-intervals in any of which a particle might decay, although the probability of a decay in any particular sub-interval is small (though constant).

  Suppose that you have n observations from such a distribution, so that the likelihood is

  where T is the sufficient statistic

  We have already seen in Section 2.10 on ‘Conjugate Prior Distributions’ that the appropriate conjugate density is

  that is, , so that λ is a multiple of a chi-squared random variable. Then the posterior density is

  that is

  where

  3.4.2 Reference prior

  This is a case where we can try to use Jeffreys’ rule. The log-likelihood resulting from a single observation x is

  so that

  and hence,

  Consequently Jeffreys’ rule suggests the prior

  which corresponds to , S0=0 in the conjugate family, and is easily seen to be equivalent to a prior uniform in . It may be noted that there is a sense in which this is intermediate between a prior uniform in and one uniform in λ itself, since as

  so that there is a sense in which the transformation from λ to can be regarded as a ‘zeroth power’ transformation (cf. Box and Cox, 1964).

  On the other hand, it could be argued that λ is a scale parameter between 0 and and that the right reference prior should therefore be

  which is uniform in and corresponds to , S0=0 in the conjugate family. However, the difference this would make in practice would almost always be negligible.

  3.4.3 Example

  The numbers of misprints spotted on the first few pages of an early draft of this book were

  It seems reasonable that these numbers should constitute a sample from a Poisson distribution of unknown mean λ. If you had no knowledge of my skill as a typist, you might adopt the reference prior uniform in for which , S0=0. Since

  your posterior for λ would then be , that is, . This distribution has mean and variance

  Of course, I have some experience of my own skill as a typist, so if I considered these figures, I would have used a prior with a mean of about 3 and variance about 4. (As a matter of fact, subsequent re-readings have caused me to adjust my prior beliefs about λ in an upwards direction!) If then I seek a prior in the conjugate family, I need

  which implies and S0=1.5. This means that my posterior has , S1=13.5 and so has mean and variance

  The difference between the two posteriors is not great and of course would become less and less as more data were included in the analysis. It would be easy enough to give HDRs. According to arguments presented in other cases, it would be appropriate to use HDRs for the chi (rather than the chi-squared distribution), but it really would not make much difference if the regions were based on HDRs for chi-squared or on values of chi-squared corresponding to HDRs for log chi-squared.

  3.4.4 Predictive distribution

  Once we know that λ has a posterior distribution

  then since

  it follows that the predictive distribution

  (dropping a factor which depends on alone). Setting , you can find the constant by reference to Appendix A. In fact, at least when is an integer, the predictive distribution is negative binomial, that is

  Further, although this point is not very important, it is not difficult to see that the negative binomial distribution can be generalized to the case where is not an integer. All we need to do is to replace some factorials by corresponding gamma functions and note that (using the functional equation for the gamma function)

  so that you can write the general binomial coefficient as

  The negative binomial distribution is usually defined in terms of a sequence of independent trials each of which results in success or failure with the same probabilities π and (such trials are often called Bernoulli trials) and considering the number x of failures before the nth success. We will not have much more use for this distribution in this book, but it is interesting to see it turning up here in a rather different context.

  3.5 The uniform distribution

  3.5.1 Preliminary definitions

  The support of a density is defined as the set of values of x for which it is non-zero. A simple example of a family of densities in which the support depends on the unknown parameter is the family of uniform distributions (defined later). While problems involving the uniform distribution do not arise all that often in practice, it is worth while seeing what complications can arise in cases where the support does depend on the unknown parameter.

  It is useful to begin with a few definitions. The indicator function of any set A is defined by

  This is sometimes called the characteristic function of the set A in some other branches of mathematics, but not in probability and statistics (where the term characteristic function is applied to the Fourier–Stieltjes transform of the distribution function).

  We say that y has a Pareto distribution with parameters and γ and write

  if it has density

  This distribution is often used as a model for distributions of income. A survey of its properties and applications can be found in Arnold (1983).

  We say that x has a uniform distribution (or a rectangular distribution) on and write

  if it has density

  so that all values in the interval are equally likely.

  3.5.2 Uniform distribution with a fixed lower endpoint

  Now suppose we have n independent observations such that

  for each i, where θ is a single unknown parameter. Then

  It is now easy to see that we can write the likelihood as

  Defining

  it is clear that

  Because the likelihood depends on the data through M alone, it follows that M is sufficient for θ given .

  It is now possible to see that the Pareto distribution provides the conjugate prior for the above likelihood. For if θ has prior

  then the posterior is

  If now we write

  so that if and only if and , and hence

  we see that

  It follows that if the prior is then the posterior is and hence that the Pareto distribution does indeed provide the conjugate family. We should note that neither the uniform nor the Pareto distribution falls into the exponential family, so that we are not here employing the unambiguous definition of conjugacy given in Section 2.11 on ‘The exponential family’. Although this means that the cautionary remarks of Diaconis and Ylvisaker (1979 and 1985) (quoted in Section 2.10 on conjugate prior distributions) apply, there is no doubt of the ‘naturalness’ of the Pareto distribution in this context.

  3.5.3 The general uniform distribution

  The case where both parameters of a uniform distribution are unknown is less important, but it can be dealt with similarly. In this case, it turns out that an appropriate family of conjugate prior distributions is given by the bilateral bivariate Pareto distribution. We say that the ordered pair (y, z) has such a distribution and write

  if the joint density is

  Now suppose we have n independent observations such that

  where α and β are unknown. Then

  Defining

  it is clear that the likelihood can be written as

  Because the likelihood depends on the data through m and M alone, it follows that (m, M) is sufficient for given .

  It is now possible to see
that the bilateral bivariate Pareto distribution provides the conjugate prior for the aforementioned likelihood. For if has prior

  then the posterior is

  If now we write

  we see that

  It follows that if the prior is then the posterior is and hence that the bilateral bivariate Pareto distribution does indeed provide the conjugate prior.

  The properties of this and of the ordinary Pareto distribution are, as usual, described in Appendix A.

  3.5.4 Examples

  I realize that the case of the uniform distribution, and in particular the case of a uniform distribution on , must be of considerable importance, since it is considered in virtually all the standard text books on statistics. Strangely, however, none of the standard references seems to be able to find any reasonably plausible practical case in which it arises [with apologies to DeGroot (1970, Section 9.7) if his case really does arise]. In the circumstances, consideration of examples is deferred until Section 3.6, and even then the example considered will be artificial.

 

‹ Prev