by Peter M Lee
(cf. Section 2.4 on ‘Dominant likelihoods’) with
The general argument about data translated likelihoods now suggests that we take as reference prior an improper density which is locally uniform in , that is , which, in terms of corresponds to
and so to
(Although the above argument is complicated, and a similarly complicated example will occur in the case of the uniform distribution in 3.6, there will be no other difficult arguments about data translated likelihoods.)
This prior (which was first mentioned in Section 2.4 on dominant likelihoods) is, in fact, a particular case of the priors of the form , which we were considering earlier, in which and S0=0. Use of the reference prior results in a posterior distribution
which again is a particular case of the distribution found before, and is quite easy to use.
You should perhaps be warned that inferences about variances are not as robust as inferences about means if the underlying distribution turns out to be only approximately normal, in the sense that they are more dependent on the precise choice of prior distribution.
2.8 HDRs for the normal variance
2.8.1 What distribution should we be considering?
It might be thought that as the normal variance has (under the assumptions we are making) a distribution which is a multiple of the inverse chi-squared distribution we should be using tables of HDRs for the inverse chi-squared distribution to give intervals in which most of the posterior distribution lies. This procedure is, indeed, recommended by, for example, Novick and Jackson (1974, Section 7.3) and Schmitt (1969, Section 6.3). However, there is another procedure which seems to be marginally preferable.
The point is that we chose a reference prior which was uniform in so that the density of was constant and no value of was more likely than any other a priori. Because of this, it seems natural to use in the posterior distribution and thus to look for an interval inside which the posterior density of log is higher than anywhere outside. It might seem that this implies the use of tables of HDRs of log chi-squared, but in practice it is more convenient to use tables of the corresponding values of chi-squared, and such tables can be found in the Appendix. In fact, it does not make much difference whether we look for regions of highest density of the inverse chi-squared distribution or of the log chi-squared distribution, but insofar as there is a difference it seems preferable to base inferences on the log chi-squared distribution.
2.8.2 Example
When we considered the normal distribution with unknown mean but known variance, we had to admit that this was a situation which rarely occurred in real-life examples. This is even more true when it comes to the case where the mean is known and the variance unknown, and it should really be thought of principally as a building block towards the structure we shall erect to deal with the more realistic case where both mean and variance are unknown.
We shall, therefore, consider an example in which the mean was in fact unknown, but treat it as if the mean were known. The following numbers give the uterine weight (in mg) of 20 rats drawn at random from a large stock:
It is easily checked that n=20, , , so that and
In such a case, we do not know that the mean is 21.0 (or at least it is difficult to imagine circumstances in which we could have this information). However, we shall exemplify the methodology for the case where the mean is known by analyzing this data as if we knew that the mean were . If this were so, then we would be able to assert that
All the information we have about the variance is contained in this statement, but of course it is not necessarily easy to interpret from the point of view of someone inexperienced with the use of statistical methods (or even of someone who is but does not know about the inverse chi-squared distribution). Accordingly, it may be useful to give some idea of the distribution if we look for a HDR. From the tables in the Appendix, we see that the values of chi-squared corresponding to a 95% HDR for log chi-squared are 9.958 and 35.227, so that the interval for is from 664/35.227 to 664/9.958, that is is the interval (19, 67). (We note that it is foolish to quote too many significant figures in your conclusions, though it may be sensible to carry through extra significant figures in intermediate calculations.) It may be worth comparing this with the results from looking at HDRs for the inverse chi-squared distribution itself. From the tables in the Appendix A, 95% HDR for the inverse chi-squared distribution on 20 degrees of freedom lies between 0 025 and 0 094, so that the interval for is from to , that is is the interval (17, 62). It follows that the two methods do not give notably different answers.
2.9 The role of sufficiency
2.9.1 Definition of sufficiency
When we considered the normal variance with known mean, we found that the posterior distribution depended on the data only through the single number S. It often turns out that the data can be reduced in a similar way to one or two numbers, and as long as we know them we can forget the rest of the data. It is this notion that underlies the formal definition of sufficiency.
Suppose observations are made with a view to gaining knowledge about a parameter θ, and that
is a function of the observations. We call such a function a statistic. We often suppose that t is real valued, but it is sometimes vector valued. Using the formulae in Section 1.4 on ‘Several Random Variables’ and the fact that once we know x we automatically know the value of t, we see that for any statistic t
However, it sometimes happens that
does not depend on θ, so that
If this happens, we say that t is a sufficient statistic for θ given X, often abbreviated by saying that t is sufficient for θ. It is occasionally useful to have a further definition as follows: a statistic whose density does not depend on θ is said to be ancillary for θ.
2.9.2 Neyman’s factorization theorem
The following theorem is frequently used in finding sufficient statistics:
Theorem 2.2 A statistic t is sufficient for θ given X if and only if there are functions f and g such that
where .
Proof. If t is sufficient for θ given we may take
Conversely, if the condition holds, then, in the discrete case, since once x is known then t is known,
We may then sum both sides of the equation over all values of X such that to get
where G(t) is obtained by summing over all these values of X, using the formula
In the continuous case, write
Then
so that on differentiating with respect to t we find that
Writing G(t) for the last integral, we get the same result as in the discrete case, viz.
From this it follows that
Considering now any one value of X such that and substituting in the equation in the statement of the theorem we obtain
Since whether t is sufficient or not
we see that
Since the right-hand side does not depend on θ, it follows that t is indeed sufficient, and the theorem is proved.
2.9.3 Sufficiency principle
Theorem 2.3 A statistic t is sufficient for θ given X if and only if
whenever (where the constant of proportionality does not, of course, depend on θ).
Proof. If t is sufficient for θ given X then
Conversely, if the condition holds then
so that for some function
The theorem now follows from the Factorization Theorem.
Corollary 2.1 For any prior distribution, the posterior distribution of θ given X is the same as the posterior distribution of θ given a sufficient statistic t.
Proof. From Bayes’ Theorem is proportional to ; they must then be equal as they both integrate or sum to unity.
Corollary 2.2 If a statistic is such that whenever , then it is sufficient for θ given X.
Proof. By summing or integrating over all X such that , it follows that
the summations being over all such that . The result now follows from the theorem.
2.9.4 Examples
Nor
mal variance. In the case where the xi are normal of known mean μ and unknown variance , we noted that
where . It follows from the Factorization Theorem that S is sufficient for μ given X. Moreover, we can verify the Sufficiency Principle as follows. If we had simply been given the value of S without being told the values of separately, we could have noted that for each i
so that is a sum of squares of n independent variables. Now a distribution is often defined as being the distribution of the sum of squares of n random variables with an distribution, and the density of can be deduced from this. It follows that and hence if then
Using the change of variable rule it is then easily seen that
We can thus verify the Sufficiency Principle in this particular case because
Exponential distribution. Let us suppose that xi () are independently distributed with an exponential distribution (see under the gamma distribution in Appendix A) so that or equivalently . Then
where . It follows from the Factorization Theorem that S is sufficient for θ given S. It is also possible to verify the Sufficiency Principle in this case. In this case it is not hard to show that so that
and we find
Poisson case. Recall that the integer-valued random variable x is said to have a Poisson distribution of mean λ [denoted ] if
We shall consider the Poisson distribution in more detail later in the book. For the moment, all that matters is that it often serves as a model for the number of occurrences of a rare event, for example for the number of times the King’s Arms on the riverbank at York is flooded in a year. Then if have independent Poisson distributions with the same mean (so could, e.g. represent the numbers of floods in several successive years), it is easily seen that
where
It follows from the Factorization Theorem that T is sufficient for λ given X. Moreover, we can verify the Sufficiency Principle as follows. If we had simply been given the value of T without being given the values of the xi separately, we could have noted that a sum of independent Poisson distributions has a Poisson distribution with mean the sum of the means (see question 7 on Chapter 1), so that
and hence,
in accordance with the sufficiency principle.
2.9.5 Order statistics and minimal sufficient statistics
It may be noted that it is easy to see that whenever consists of independently identically distributed observations whose distribution depends on a parameter θ, then the order statistic
which consists of the values of the xi arranged in increasing order, so that
is sufficient for θ given X.
This helps to underline the fact that there is, in general, no such thing as a unique sufficient statistic. Indeed, if t is sufficient for θ given X, then so is (t, u) for any statistic . If t is a function of all other sufficient statistics that can be constructed, so that no further reduction is possible, then t is said to be minimal sufficient. Even a minimal sufficient statistic is not unique, since any one function of such a statistic is itself minimal sufficient.
It is not obvious that a minimal sufficient statistic always exists, but in fact, it does. Although the result is more important than the proof. we shall now prove this. We define a statistic which is a set, rather than a real number or a vector, by
Then it follows from Corollary 2.2 to the Sufficiency Principle that u is sufficient. Further, if is any other sufficient statistic, then by the same principle whenever we have
and hence, , so that u is a function of v. It follows that u is minimal sufficient. We can now conclude that the condition that
if and only if
is equivalent to the condition that t is minimal sufficient.
2.9.6 Examples on minimal sufficiency
Normal variance. In the case where the xi are independently where μ is known but is unknown, then S is not merely sufficient but minimal sufficient.
Poisson case. In the case where the xi are independently , then T is not merely sufficient but minimal sufficient.
Cauchy distribution. We say that x has a Cauchy distribution with location parameter θ and scale parameter 1, denoted if it has density
It is hard to find examples of real data which follow a Cauchy distribution, but the distribution often turns up in counter-examples in theoretical statistics (e.g. a mean of n variables with a C distribution has itself a C distribution and does not tend to normality as n tends to infinity in apparent contradiction of the Central Limit Theorem). Suppose that are independently C. Then if we must have
By comparison of the coefficients of the constant of proportionality must be 1 and by comparison of the zeroes of both sides considered as polynomials in θ, namely, and respectively, we see that the must be a permutation of the xk and hence the order statistics and of and X are equal. It follows that the order statistic is a minimal sufficient statistic, and in particular there is no one-dimensional sufficient statistic. This sort of situation is unusual with the commoner statistical distributions, but you should be aware that it can arise, even if you find the above proof confusing.
A useful reference for advanced workers in this area is Huzurbazar (1976).
2.10 Conjugate prior distributions
2.10.1 Definition and difficulties
When the normal variance was first mentioned, it was stated said that it helps if the prior is of such that the posterior is of a ‘nice’ form, and this led to the suggestion that if a reasonable approximation to your prior beliefs could be managed by using (a multiple of) an inverse chi-squared distribution, it would be sensible to employ this distribution. It is this thought which leads to the notion of conjugate families. The usual definition adopted is as follows:
Let l be a likelihood function . A class Π of prior distributions is said to form a conjugate family if the posterior density
is in the class Π for all X whenever the prior density is in Π.
There is actually a difficulty with this definition, as was pointed out by Diaconis and Ylvisaker (1979 and 1985). If Π is a conjugate family and is any fixed function, then the family of densities proportional to for is also a conjugate family. While this is a logical difficulty, we are in practice only interested in ‘natural’ families of distributions which are at least simply related to the standard families that are tabulated. In fact, there is a more precise definition available when we restrict ourselves to the exponential family (discussed in Section 2.11), and there are not many cases discussed in this book that are not covered by that definition. Nevertheless, the usual definition gives the idea well enough.
2.10.2 Examples
Normal mean. In the case of several normal observations of known variance with a normal prior for the mean (discussed in Section 2.3), where
we showed that if the prior is then the posterior is for suitable and . Consequently if Π is the class of all normal distributions, then the posterior is in Π for all X whenever the prior is in Π. Note, however, that it would not do to let Π be the class of all normal distributions with any mean but fixed variance (at least unless we regard the sample size as fixed once and for all); Π must in some sense be ‘large enough.’
Normal variance. In the case of the normal variance, where
we showed that if the prior is then the posterior is . Consequently, if Π is the class of distributions of constant multiples of inverse chi-squares, then the posterior is in Π whenever the prior is. Again, it is necessary to take Π as a two-parameter rather than a one-parameter family.
Poisson distribution. Suppose is a sample from the Poisson distribution of mean λ, that is . Then as we noted in the last section
where . If λ has a prior distribution of the form
that is so that λ is a multiple of a chi-squared random variable, then the posterior is
Consequently, if Π is the class distributions of constant multiples of chi-squared random variables, then the posterior is in Π whenever the prior is. There are three points to be drawn to your attention. Firstly, this family is cl
osely related to, but different from, the conjugate family in the previous example. Secondly, the conjugate family consists of a family of continuous distributions although the observations are discrete; the point is that this discrete distribution depends on a continuous parameter. Thirdly, the conjugate family in this case is usually referred to in terms of the gamma distribution, but the chi-squared distribution is preferred here in order to minimize the number of distributions you need to know about and because when you need to use tables, you are likely to refer to tables of chi-squared in any case; the two descriptions are of course equivalent.
Binomial distribution. Suppose that k has a binomial distribution of index n and parameter π. Then
We say that π has a beta distribution with parameters α and , denoted if its density is of the form
(the fact that and appear in the indices rather than α and is for technical reasons). The beta distribution is described in more detail in Appendix A. If, then, π has a beta prior density, it is clear that it has a beta posterior density, so that the family of beta densities forms a conjugate family. It is a simple extension that this family is still conjugate if we have a sample of size k rather than just one observation from a binomial distribution.