by Peter M Lee
This example will be considered further in Chapter 9.
2.13.4 Concluding remarks
While Bayesian techniques are, in principle, just as applicable when there are two or even more unknown parameters as when there is only one unknown parameter, the practical problems are considerably increased. The computational problems can be quite severe if the prior is not from the conjugate family, but even more importantly it is difficult to convince yourself that you have specified the prior to your satisfaction. In the case of the normal distribution, the fact that if the prior is taken from the conjugate family the mean and variance are not usually independent makes it quite difficult to understand the nature of the assumption you are making. Of course, the more data you have, the less the prior matters and hence some of the difficulties become less important. Fuller consideration will be given to numerical methods in Chapter 9.
2.14 Exercises on Chapter 2
1. Suppose that . Find the standardized likelihood as a function of π for given k. Which of the distributions listed in Appendix A does this represent?
2. Suppose we are given the 12 observations from a normal distribution:
15.644, 16.437, 17.287, 14.448, 15.308, 15.169,
18.123, 17.635, 17.259, 16.311, 15.390, 17.252,
and we are told that the variance . Find a 90% HDR for the posterior distribution of the mean assuming the usual reference prior.
3. With the same data as in the previous question, what is the predictive distribution for a possible future observation x?
4. A random sample of size n is to be taken from an distribution where is known. How large must n be to reduce the posterior variance of to the fraction of its original value (where k> 1)?
5. Your prior beliefs about a quantity θ are such that
A random sample of size 25 is taken from an distribution and the mean of the observations is observed to be 0.33. Find a 95% HDR for θ.
6. Suppose that you have prior beliefs about an unknown quantity θ which can be approximated by an distribution, while my beliefs can be approximated by an distribution. Suppose further that the reasons that have led us to these conclusions do not overlap with one another. What distribution should represent our beliefs about θ when we take into account all the information available to both of us?
7. Prove the theorem quoted without proof in Section 2.4.
8. Under what circumstances can a likelihood arising from a distribution in the exponential family be expressed in data translated form?
9. Suppose that you are interested in investigating how variable the performance of schoolchildren on a new mathematics test, and that you begin by trying this test out on children in 12 similar schools. It turns out that the average standard deviation is about 10 marks. You then want to try the test on a thirteenth school, which is fairly similar to those you have already investigated, and you reckon that the data on the other schools gives you a prior for the variance in this new school which has a mean of 100 and is worth eight direct observations on the school. What is the posterior distribution for the variance if you then observe a sample of size 30 from the school of which the standard deviation is 13.2? Give an interval in which the variance lies with 90% posterior probability.
10. The following are the dried weights of a number of plants (in g) from a batch of seeds:
4.17, 5.58, 5.18, 6.11, 4.50, 4.61, 5.17, 4.53, 5.33, 5.14.
Give 90% HDRs for the mean and variance of the population from which they come.
11. Find a sufficient statistic for μ given an n-sample from the exponential distribution
where the parameter μ can take any value in .
12. Find a (two-dimensional) sufficient statistic for given an n-sample from the two-parameter gamma distribution
where the parameters α and can take any values in , .
13. Find a family of conjugate priors for the likelihood where is as in the previous question, but α is known.
14. Show that the tangent of a random angle (i.e. one which is uniformly distributed on ) has a Cauchy distribution C(0, 1).
15. Suppose that the vector has a trinomial distribution depending on the index n and the parameter where , that is
Show that this distribution is in the two-parameter exponential family.
16. Suppose that the results of a certain test are known, on the basis of general theory, to be normally distributed about the same mean μ with the same variance , neither of which is known. Suppose further that your prior beliefs about can be represented by a normal/chi-squared distribution with
Now suppose that 100 observations are obtained from the population with mean 89 and sample variance s2=30. Find the posterior distribution of . Compare 50% prior and posterior HDRs for μ.
17. Suppose that your prior for θ is a mixture of and and that a single observation turns out to equal 2. What is your posterior probability that ?
18. Establish the formula
where n1=n0+n and , which was quoted in Section 2.13 as providing a formula for the parameter S1 of the posterior distribution in the case where both mean and variance are unknown which is less susceptible to rounding errors.
3
Some other common distributions
3.1 The binomial distribution
3.1.1 Conjugate prior
In this section, the parameter of interest is the probability π of success in a number of trials which can result in success (S) or failure (F), the trials being independent of one another and having the same probability of success. Suppose that there is a fixed number of n trials, so that you have an observation x (the number of successes) such that
from a binomial distribution of index n and parameter π, and so
The binomial distribution was introduced in Section 1.3 on ‘Random Variables’ and its properties are of course summarized in Appendix A.
Figure 3.1 Examples of beta densities.
If your prior for π has the form
that is, if
has a beta distribution (which is also described in the same Appendix), then the posterior evidently has the form
that is
It is immediately clear that the family of beta distributions is conjugate to a binomial likelihood.
The family of beta distributions is illustrated in Figure 3.1. Basically, any reasonably smooth unimodal distribution on [0, 1] is likely to be reasonably well approximated by some beta distribution, so that it is very often possible to approximate your prior beliefs by a member of the conjugate family, with all the simplifications that this implies. In identifying an appropriate member of the family, it is often useful to equate the mean
of to a value which represents your belief and to a number which in some sense represents the number of observations which you reckon your prior information to be worth. (It is arguable that it should be or that should equal this number, but in practice this will make no real difference). Alternatively, you could equate the mean to a value which represents your beliefs about the location of π and the variance
of to a value which represents how spread out your beliefs are.
3.1.2 Odds and log-odds
We sometimes find it convenient to work in terms of the odds on success against failure, defined by
so that . One reason for this is the relationship mentioned in Appendix A that if then
has Snedecor’s F distribution. Moreover, the log-odds
is close to having Fisher’s z distribution; more precisely
It is then easy to deduce from the properties of the z distribution given in Appendix A that
One reason why it is useful to consider the odds and log-odds is that tables of the F and z distributions are more readily available than tables of the beta distribution.
3.1.3 Highest density regions
Tables of HDRs of the beta distribution are available [see, Novick and Jackson (1974, Table A.15) or Isaacs et al. (1974, Table 43)], but it is not necessary or particularly desirable to use them. (The reason is related to the reason for
not using HDRs for the inverse chi-squared distribution as such.) In Section 3.2, we shall discuss the choice of a reference prior for the unknown parameter π. It turns out that there are several possible candidates for this honour, but there is at least a reasonably strong case for using a prior
Using the usual change-of-variable rule , it is easily seen that this implies a uniform prior
in the log-odds Λ. As argued in Section 2.8 on ‘HDRs for the normal variance’, this would seem to be an argument in favour of using an interval in which the posterior distribution of is higher than anywhere outside. The Appendix includes tables of values of F corresponding to HDRs for log F, and the distribution of Λ as deduced earlier is clearly very nearly that of log F. Hence in seeking for, for example, a 90% interval for π when , we should first look up values and corresponding to a 90% HDR for . Then a suitable interval for values of the odds λ is given by
from which it follows that a suitable interval of values of π is
If the tables were going to be used solely for this purpose, they could be better arranged to avoid some of the arithmetic involved at this stage, but as they are used for other purposes and do take a lot of space, the minimal extra arithmetic is justifiable.
Although this is not the reason for using these tables, a helpful thing about them is that we need not tabulate values of and for . This is because if F has an distribution then F–1 has an distribution. It follows that if an HDR for log F is then an HDR for log F–1 is , and so if is replaced by then the interval is simply replaced by . There is no such simple relationship in tables of HDRs for F itself or in tables of HDRs for the beta distribution.
3.1.4 Example
It is my guess that about 20% of the best known (printable) limericks have the same word at the end of the last line as at the end of the first. However, I am not very sure about this, so I would say that my prior information was only ‘worth’ some nine observations. If I seek a conjugate prior to represent my beliefs, I need to take
These equations imply that and . There is no particular reason to restrict α and β to integer values, but on the other hand prior information is rarely very precise, so it seems simpler to take and . Having made these conjectures, I then looked at one of my favourite books of light verse, Silcock (1952), and found that it included 12 limericks, of which two (both by Lear) have the same word at the ends of the first and last lines. This leads me to a posterior which is Be(4, 17). I can obtain some idea of what this distribution is like by looking for a 90% HDR. From interpolation in the tables in the Appendix, values of F corresponding to a 90% HDR for log F34,8 are and . It follows that an appropriate interval of values of F8,34 is , that is (0.35, 2.38), so that an appropriate interval for π is
that is (0.08, 0.36).
If for some reason, we want HDRs for π itself, instead of for and insist on using HDRs for π itself, then we can use the tables quoted earlier [namely, Novick and Jackson (1974, Table A.15) or Isaacs et al. (1974, Table 43)]. Alternatively, Novick and Jackson (1974, Section 5.5), point out that a reasonable approximation can be obtained by finding the median of the posterior distribution and looking for a 90% interval such that the probability of being between the lower bound and the median is 45% and the probability of being between the median and the upper bound is 45%. The usefulness of this procedure lies in the ease with which it can be followed using tables of the percentage points of the beta distribution alone, should tables of HDRs be unavailable. It can even be used in connection with the nomogram which constitutes Table 17 of Pearson and Hartley (ed.) (1966), although the accuracy resulting leaves something to be desired. On the whole, the use of the tables of values of F corresponding to HDRs for log F, as described earlier, seems preferable.
3.1.5 Predictive distribution
The posterior distribution is clearly of the form for some α and β (which, of course, include x and n–x, respectively), so that the predictive distribution of the next observation after we have the single observation x on top of our previous background information is
This distribution is known as the beta-binomial distribution,or sometimes as the Pólya distribution [see Calvin, 1984]. We shall not have a great deal of use for it in this book, although we will refer to it briefly in Chapter 9. It is considered, for example, in Raiffa and Schlaifer (1961, Section 7.11). We shall encounter a related distribution, the beta-Pascal distribution in Section 7.3 when we consider informative stopping rules.
3.2 Reference prior for the binomial likelihood
3.2.1 Bayes’ postulate
The Rev. Thomas Bayes himself in Bayes (1763) put forward arguments in favour of a uniform prior
(which, unlike the choice of a prior uniform over , is a proper density in that it integrates to unity) as the appropriate one to use when we are ‘completely ignorant’. This choice of prior has long been known as Bayes’ postulate, as distinct from his theorem. The same prior was used by Laplace (1774). It is a member of the conjugate family, to wit Be(1, 1).
Bayes’ arguments are quite intricate, and still repay study. Nevertheless, he seems to have had some doubts about the validity of the postulate, and these doubts appear to have been partly responsible for the fact that his paper was not published in his lifetime, but rather communicated posthumously by his friend Richard Price.
The postulate seems intuitively reasonable, in that it seems to treat all values on a level and thus reflect the fact that you so no reason for preferring any one value to any other. However, you should not be too hasty in endorsing it because ignorance about the value of π presumably implies ignorance about the value of any function of π, and yet when the change of variable rule is used a uniform prior for π will not usually imply a uniform prior for any function of π.
One possible argument for it is as follows. A ‘natural’ estimator for the parameter π of a binomial distribution of index n is the observed proportion x/n of successes, and it might seem a sensible estimator to use when we have no prior information. It is in fact the maximum likelihood estimator, that is, the value of π for which the likelihood
is a maximum. In classical or sampling theory statistics it is also commended for various reasons which do not usually carry much weight with Bayesians, for example that it is unbiased, that is,
(the expectation being taken over repeated sampling) whatever the value of π is. Indeed, it is not hard to show that it is a minimum variance unbiased estimator (MVUE).
Now if you have a prior and so get a posterior which is , it might seem natural to say that a good estimator for π would be obtained by finding that value at which the posterior density is a maximum, that is, the posterior mode. This procedure is clearly related to the idea of maximum likelihood. Since the posterior mode occurs at
as is easily checked by differentiation, this posterior mode coincides with x/n if and only if , that is, the prior is uniform.
Jeffreys (1961, Section 3.1) argued that ‘Again, is there not a preponderence at the extremes. Certainly if we take the Bayes-Laplace rule right up to the extremes we are lead to results that do not correspond to anybody’s way of thinking.’
3.2.2 Haldane’s prior
Another suggestion, due to Haldane (1931), is to use a Be(0, 0) prior, which has density
which is an improper density and is equivalent (by the usual change of variable argument) to a prior uniform in the log-odds
An argument for this prior based on the ‘naturalness’ of the estimator x/n when is that the mean of the posterior distribution for π, namely, , is
which coincides with x/n if and only if . (There is a connection here with the classical notion of the unbiasedness of x/n.)
Another argument that has been used for this prior is that since any observation always increases either α or β, it corresponds to the greatest possible ignorance to take α and β as small as possible. For a beta density to be proper (i.e. to have a finite integral and so be normalizable, so that its integral is unity) it is necessary and sufficient that α and
β should both be strictly greater than 0. This can then be taken as an indication that the right reference prior is Be(0, 0).
A point against this choice of prior is that if we have one observation with probability of success π, then use of this prior results in a posterior which is Be(1, 0) if that observation is a success and Be(0, 1) if it is a failure. However, a Be(1, 0) distribution gives infinitely more weight to values near 1 than to values away from 1, and so it would seem that a sample with just one success in it would lead us to conclude that all future observations will result in successes, which seems unreasonable on the basis of so small an amount of evidence.
3.2.3 The arc-sine distribution
A possible compromise between Be(1, 1) and Be(0, 0) is , that is, the (proper) density
This distribution is sometimes called the arc-sine distribution (cf. Feller 1968, 1; III.4). In Section 3.3, we will see that a general principle known as Jeffreys’ rule suggests that this is the correct reference prior to use. However, Jeffreys’ rule is a guideline which cannot be followed blindly, so that in itself does not settle the matter.
The prior can easily be shown (by the usual change-of-variable rule to imply a uniform prior for
This transformation is related to the transformation of the data when
in which z is defined by
This transformation was first introduced in 1.5 on ‘Means and Variances’, where we saw that it results in the approximations