by Peter M Lee
8.1.2.3 Baseball statistics
The batting average of a baseball player is defined as the number of hits Si divided by the number of times at bat; it is always a number between 0 and 1. We will suppose that each of r players have been n times at bat and that the batting average of the ith player Yi=Si/n is such that , so that using the inverse root-sine transformation (see Section 3.2 on ‘Reference prior for the binomial likelihood’), we see that if
then to a good approximation
We might then suppose that
Finally, we suppose that is known and that the prior knowledge of μ is weak, so that over the range over which the likelihood is appreciable, the prior density of μ is constant (cf. Section 2.5 on ‘Locally uniform priors’).
This example is considered by Efron and Morris (1975 and 1977); we give further consideration to it in Section 8.3. These authors also consider an example arising from data on the incidence of toxoplasmosis (a disease of the blood that is endemic in much of Central America) among samples of various sizes from 36 cities in El Salvador.
8.1.2.4 Poisson process with a change point
Suppose that we have observations xi for which represent the number of times a rare event has occurred in each of n equal time intervals and that we have reason to believe that the frequency of this event has changed abruptly from one level to another at some intermediate value of i. We might then be interested in deciding whether there really is evidence of such an abrupt change and, if so, then investigating when it took place.
To model this situation, we suppose that for while for . We then take independent priors for the parameters such that
a. , that is, k has a discrete uniform distribution on [1, n];
b. where U has an exponential distribution of mean 1 (or equivalently a one-parameter gamma distribution with parameter 1, so that );
c. where V is independent of U and has the same distribution.
These distributions depend on the two parameters and , so that are hyperparameters.
Finally, we suppose that and have independent prior distributions which are multiples of chi-squared, so that for suitable values of the parameters , ζ, α and β we have and .
This situation is a slightly simplified version of one described by Carlin et al. (1992), Tanner (1996, Sections 6.2.2 and 6.2.3) and Carlin and Louis (2008, Chapter 5, Exercises 8–10) as a model for the numbers of coal-mining disasters (defined as accidents resulting in the deaths of ten or more miners) in Great Britain for the years from 1851 to 1962 inclusive. We shall consider this example in detail in 9.4 on ‘The Gibbs sampler’.
8.1.2.5 Risk of tumour in a group of rats
In a study of tumours among laboratory rats of type ‘F344’, the probability of tumours in different groups of rats is believed to vary because of differences between rats and experimental conditions among the experiments. It may well be reasonable to suppose that the probabilities come from a beta distribution, but it is not clear a priori which prior beta distribution to take.
In this case, the number of rats yi that develop tumours in the ith group which is of size ni is such that , while . We then take some appropriate hyperprior distribution for the hyperparameters ; it has been suggested that a suitable noninformative hyperprior is
This example is discussed in detail and the above hyperprior is derived in Gelman et al. (2004, Sections 5.1 and 5.3).
8.1.2.6 Vaccination against Hepatitis B
In a study of the effect of vaccination for Hepatitis B in the Gambia, it was supposed that yij, the log anti-HB titre (the amount of surface-antibody in blood samples) in the jth observation for the ith infant, could be modelled as follows:
(cf. Gilks et al., 1993, or Spiegelhalter, et al., 1996, Section 2.2). Here, the parameters are and the hyperparameters are . The hyperprior distributions for α and β are independent normals, and we take a reference prior for . Further, we have hyper-hyperparameters for which we take reference priors, so that
Actually, Gilks et al. take proper priors which over reasonable values of and behave similarly, namely,
8.1.3 Objectives of a hierarchical analysis
The objectives of hierarchical analyses vary. We can see this by considering the way in which the examples described in Subsection 8.1.2, headed ‘Examples’ might be analyzed.
In the case of the hierarchical Poisson model, the intention is to estimate the density function or equivalently the hyperparameters . Similarly, in the case of the example on rat tumours, the main interest lies in finding the joint posterior density of the hyperparameters α and β.
In the cases of the test example and the baseball example, the interest lies in estimating the parameters as well as possible, while the hyperparameters μ and are of interest mainly as tools for use in estimating .
The main interest in the case of the Poisson process with a change point could quite well lie in determining whether there really is a change point, and, assuming that there is, finding out where it occurs as closely as possible.
However, the models could be explored with other objectives. For example, in the intelligence test example we might be interested in the predictive distribution , which represents the overall distribution of ‘IQ’ in the population under consideration. Similarly, in the case of the Poisson distribution with a change point, we might be interested in the extent of the change (presuming that there is one), and hence in , that is, in the distribution of a function of the parameters.
8.1.4 More on empirical Bayes methods
The empirical Bayes method, of which a very short account was given in 7.8, is often employed in cases where we have a structural relationship as described at the start of this section. Suppose for definiteness that we have a straightforward two-stage model in which the density of the observations depends on parameters which themselves are independent observations from a density , so that we have a posterior density
We then estimate by the method of maximum likelihood or by some other method of classical statistics. Note that this method makes no use of a prior distribution for the hyperparameter .
8.2 The hierarchical normal model
8.2.1 The model
Suppose that
is a vector of fixed, unknown parameters and that
is a vector of independent observations such that
Of course, the Xi could each be means of a number of observations.
For the moment, we shall suppose that is known, so, after a suitable normalization, we can suppose that .
It is useful to establish some notation for use later on, We shall consider a fixed origin
and we will write
and
for a vector of r elements all equal to unity.
We suppose that on the basis of our knowledge of the Xi we form estimates of the and write
In general, our estimates will not be exactly right and we will adopt a decision theoretic approach as described in Section 7.5 on ‘Bayesian decision theory’. In particular, we shall suppose that by estimating the parameters we suffer a loss
We recall that the risk function is defined as
For our problem the ‘obvious’ estimator (ignoring the hierarchical structure which will be introduced later) is
and indeed since the log-likelihood is
it is the maximum likelihood estimator. It is clearly unbiased.
It is easy to find the risk of this obvious estimator – it is
8.2.2 The Bayesian analysis for known overall mean
To express this situation in terms of a hierarchical model, we need to suppose that the parameters come from some population, and the simplest possibility is to suppose that
in which case it is convenient to take . With the additional structure assumed for the means, the problem has the structure of a situation variously described as a random effects model, Model II or a components of variance model (cf. Eisenhart et al., 1947, or Scheffé, 1959, Section 7.2, n.7). We are, however, primarily interested in the means and not in the variance
components and , at least for the moment.
It follows that the posterior distribution of given is
where (writing )
and
(cf. Section 2.2 on ‘Normal Prior and Likelihood’).
To minimize the expectation of the loss over the posterior distribution of , it is clearly necessary to use the Bayes estimator
where
the posterior mean of given (see the subsection of Section 7.5 on ‘Point estimators resulting from quadratic loss’). Further, if we do this, then the value of this posterior expected loss is
It follows that the Bayes risk
(the expectation being taken over values of ) is
We note that if instead we use the maximum likelihood estimator , then the posterior expected loss is increased by an amount
which is always positive, so that
Further, since the unconditional distribution of Xi is evidently , so that , its expectation over repeated sampling (the Bayes risk) is
This is, in fact, obvious since we can also write
where the expectation is over , and since for the maximum likelihood estimator =1 for all , we have
We can thus see that use of the Bayes estimator always diminishes the posterior loss, and that the amount ‘saved’ by its use averages out at λ over repeated sampling.
8.2.3 The empirical Bayes approach
Typically, however, you will not know (or equivalently λ). In such a situation, you can attempt to estimate it from the data. Since the Xi have an unconditional distribution which is , it is clear that S1 is a sufficient statistic for , or equivalently for λ, which is such that or
so that if we define
then using the probability density of a chi-squared distribution (as given in Appendix A)
so that is an unbiased estimator of λ.
Now consider the effect of using the empirical Bayes estimator
which results from replacing λ by in the expression for . If we use this, then the value of the posterior expected loss exceeds that incurred by the Bayes rule by an amount
which is always positive, so that
Further, if we write (so that and ), then we see that the expectation of over repeated sampling is
It follows that the Bayes risk resulting from the use of the empirical Bayes estimator is
as opposed to for the Bayes estimator or 1 for the maximum likelihood estimator.
8.3 The baseball example
Efron and Morris’s example on baseball statistics was outlined in Section 8.1. As their primary data, they take the number of times hits Si or equivalently the batting averages Yi=Si/n of r=18 major league players as they were recorded after n=45 times at bat in the 1970 season. These were, in fact, all the players who happened to have batted exactly 45 times the day the data were tabulated. If Xi and are as in Section 8.1, so that approximately
then we have a case of the hierarchical normal model. With the actual data, we have
and so with
the empirical Bayes estimator for the takes the form
so giving estimates
We can test how well an estimator performs by comparing it with the observed batting averages. We suppose that the ith player had Ti hits and was at bat mi times, so that his batting average for the remainder of the season was pi=Ti/mi. If we write
we could consider a mean square error
or more directly
In either case, it turns out that the empirical Bayes estimator appears to be about three and a half times better than the ‘obvious’ (maximum likelihood) estimator which ignores the hierarchical model and just estimates each by the corresponding Xi. The original data and the resulting estimates are tabulated in Table 8.1.
Table 8.1 Data for the baseball example.
So there is evidence that in at least some practical case, use of the hierarchical model and a corresponding empirical Bayes estimator is genuinely worth while.
8.4 The Stein estimator
This section is about an aspect of classical statistics which is related to the aforementioned discussion, but an understanding of it is by no means necessary for developing a knowledge of Bayesian statistics per se. The Bayesian analysis of the hierarchical normal model is continued in Section 8.5.
One of the most puzzling and provocative results in classical statistics in the past half century was Stein’s startling discovery (see Stein, 1956, and James and Stein, 1961) that the ‘obvious’ estimator of the multivariate normal mean is inadmissible if . In fact if c is any constant with
then
dominates . The best value of c is r–2, leading to the James–Stein estimator
Because it may be considered as a weighted mean of and , it is often called a shrinkage estimator which ‘shrinks’ the ordinary estimator towards , despite the fact that if S1< r–2 it ‘shrinks’ past . Note, incidentally, that points which are initially far from are little affected by this shrinkage. Of course, this ties in with the results of Section 8.3 because the James–Stein estimator has turned out to be just the same as the empirical Bayes estimator .
In fact, it can be shown that the risk of is
The expectation on the right-hand side depends on and , but as it must be non-negative, so that dominates , that is,
for all . It turns out that S1 has a distribution which depends solely on r and the quantity
a fact which can be proved by considering an orthogonal transformation of the variates to variates Wi such that
Evidently if then , and in general we say that S1 has a non-central chi-squared distribution on r degrees of freedom with non-centrality parameter . We denote this by
It is fairly obvious that as typical values of will tend to infinity and we will get
whereas when the variate S1 has a central distribution on r degrees of freedom (no parameters are estimated), so
and hence
which, particularly for large values of r, is notably less than the risk of the obvious estimator.
In the particular case where the arbitrary origin is taken at 0 the James–Stein estimator takes the form
but it is important to note that this is only a special case.
Variants of the James–Stein estimator have been derived. For example, if c is any constant with
then
dominates , this time provided (loss of one dimension as a result of estimating a mean is something we are used to in statistics). The best value of c in this case is k–3, leading to the Efron–Morris estimator
In this case the ‘shrinkage’ is towards the overall mean.
In the case of the Efron–Morris estimator, it can be shown (see Lehmann, 1983, Section 4.6) that the risk of is
Since S has a central distribution on r–1 degrees of freedom,
and hence
which, particularly for large values of r, is again notably less than the risk of the obvious estimator.
When we consider using such estimates in practice we encounter the ‘speed of light’ rhetorical question,
Do you mean that if I want to estimate tea consumption in Taiwan, I will do better to estimate simultaneously the speed of light and the weight of hogs in Montana?
The question then arises as to why this happens. Stein’s own explanation was that the sample distance squared of from , that is , overestimates the squared distance of from and hence that the estimator could be improved by bringing it nearer (whatever is). Following an idea due to Lawrence Brown, the effect was illustrated as shown in Figure 8.1 in a paper by Berger (1980, Figure 2, p. 736).
Figure 8.1 Shrinkage estimators.
The four points , , represent a spherical distribution centred at .
Consider the effect of shrinking these points as shown. The points and move, on average, slightly further away from , but the points and move slightly closer (while distant points hardly move at all). In three dimensions, there are a further two points (not on the line between and ) that are shrunk closer to .
Another explanation that has been
offered is that can be viewed as a ‘pre-test’ estimator: if one performs a preliminary test of the hypothesis that and then uses or depending on the outcome of the test, then the resulting estimator is a weighted average of and of which is a smoothed version, although why this particular smoothing is to be used is not obvious from this chain of reasoning (cf. Lehmann, 1983, Section 4.5).
8.4.1 Evaluation of the risk of the James–Stein estimator
We can prove that the James–Stein estimator has the risk quoted earlier, namely
[An alternative approach can be found in Lehmann (1983, Sections 4.5 and 4.6).] We proceed by writing
where the expectations are over repeated sampling for fixed . The function g depends on alone by spherical symmetry about . Similarly, the function h depends on alone since . We note that because the unconditional distribution of S1 is , we have