Bayesian Statistics (4th ed)

Page 23

by Peter M Lee

8.1.2.3 Baseball statistics

The batting average of a baseball player is defined as the number of hits Si divided by the number of times at bat; it is always a number between 0 and 1. We will suppose that each of r players have been n times at bat and that the batting average of the ith player Yi=Si/n is such that , so that using the inverse root-sine transformation (see Section 3.2 on ‘Reference prior for the binomial likelihood’), we see that if

then to a good approximation

We might then suppose that

Finally, we suppose that is known and that the prior knowledge of μ is weak, so that over the range over which the likelihood is appreciable, the prior density of μ is constant (cf. Section 2.5 on ‘Locally uniform priors’).

This example is considered by Efron and Morris (1975 and 1977); we give further consideration to it in Section 8.3. These authors also consider an example arising from data on the incidence of toxoplasmosis (a disease of the blood that is endemic in much of Central America) among samples of various sizes from 36 cities in El Salvador.

8.1.2.4 Poisson process with a change point

Suppose that we have observations xi for which represent the number of times a rare event has occurred in each of n equal time intervals and that we have reason to believe that the frequency of this event has changed abruptly from one level to another at some intermediate value of i. We might then be interested in deciding whether there really is evidence of such an abrupt change and, if so, then investigating when it took place.

To model this situation, we suppose that for while for . We then take independent priors for the parameters such that

a. , that is, k has a discrete uniform distribution on [1, n];

b. where U has an exponential distribution of mean 1 (or equivalently a one-parameter gamma distribution with parameter 1, so that );

c. where V is independent of U and has the same distribution.

These distributions depend on the two parameters and , so that are hyperparameters.

Finally, we suppose that and have independent prior distributions which are multiples of chi-squared, so that for suitable values of the parameters , ζ, α and β we have and .

This situation is a slightly simplified version of one described by Carlin et al. (1992), Tanner (1996, Sections 6.2.2 and 6.2.3) and Carlin and Louis (2008, Chapter 5, Exercises 8–10) as a model for the numbers of coal-mining disasters (defined as accidents resulting in the deaths of ten or more miners) in Great Britain for the years from 1851 to 1962 inclusive. We shall consider this example in detail in 9.4 on ‘The Gibbs sampler’.

8.1.2.5 Risk of tumour in a group of rats

In a study of tumours among laboratory rats of type ‘F344’, the probability of tumours in different groups of rats is believed to vary because of differences between rats and experimental conditions among the experiments. It may well be reasonable to suppose that the probabilities come from a beta distribution, but it is not clear a priori which prior beta distribution to take.

In this case, the number of rats yi that develop tumours in the ith group which is of size ni is such that , while . We then take some appropriate hyperprior distribution for the hyperparameters ; it has been suggested that a suitable noninformative hyperprior is

This example is discussed in detail and the above hyperprior is derived in Gelman et al. (2004, Sections 5.1 and 5.3).

8.1.2.6 Vaccination against Hepatitis B

In a study of the effect of vaccination for Hepatitis B in the Gambia, it was supposed that yij, the log anti-HB titre (the amount of surface-antibody in blood samples) in the jth observation for the ith infant, could be modelled as follows:

(cf. Gilks et al., 1993, or Spiegelhalter, et al., 1996, Section 2.2). Here, the parameters are and the hyperparameters are . The hyperprior distributions for α and β are independent normals, and we take a reference prior for . Further, we have hyper-hyperparameters for which we take reference priors, so that

Actually, Gilks et al. take proper priors which over reasonable values of and behave similarly, namely,

8.1.3 Objectives of a hierarchical analysis

The objectives of hierarchical analyses vary. We can see this by considering the way in which the examples described in Subsection 8.1.2, headed ‘Examples’ might be analyzed.

In the case of the hierarchical Poisson model, the intention is to estimate the density function or equivalently the hyperparameters . Similarly, in the case of the example on rat tumours, the main interest lies in finding the joint posterior density of the hyperparameters α and β.

In the cases of the test example and the baseball example, the interest lies in estimating the parameters as well as possible, while the hyperparameters μ and are of interest mainly as tools for use in estimating .

The main interest in the case of the Poisson process with a change point could quite well lie in determining whether there really is a change point, and, assuming that there is, finding out where it occurs as closely as possible.

However, the models could be explored with other objectives. For example, in the intelligence test example we might be interested in the predictive distribution , which represents the overall distribution of ‘IQ’ in the population under consideration. Similarly, in the case of the Poisson distribution with a change point, we might be interested in the extent of the change (presuming that there is one), and hence in , that is, in the distribution of a function of the parameters.

8.1.4 More on empirical Bayes methods

The empirical Bayes method, of which a very short account was given in 7.8, is often employed in cases where we have a structural relationship as described at the start of this section. Suppose for definiteness that we have a straightforward two-stage model in which the density of the observations depends on parameters which themselves are independent observations from a density , so that we have a posterior density

We then estimate by the method of maximum likelihood or by some other method of classical statistics. Note that this method makes no use of a prior distribution for the hyperparameter .

8.2 The hierarchical normal model

8.2.1 The model

Suppose that

is a vector of fixed, unknown parameters and that

is a vector of independent observations such that

Of course, the Xi could each be means of a number of observations.

For the moment, we shall suppose that is known, so, after a suitable normalization, we can suppose that .

It is useful to establish some notation for use later on, We shall consider a fixed origin

and we will write

and

for a vector of r elements all equal to unity.

We suppose that on the basis of our knowledge of the Xi we form estimates of the and write

In general, our estimates will not be exactly right and we will adopt a decision theoretic approach as described in Section 7.5 on ‘Bayesian decision theory’. In particular, we shall suppose that by estimating the parameters we suffer a loss

We recall that the risk function is defined as

For our problem the ‘obvious’ estimator (ignoring the hierarchical structure which will be introduced later) is

and indeed since the log-likelihood is

it is the maximum likelihood estimator. It is clearly unbiased.

It is easy to find the risk of this obvious estimator – it is

8.2.2 The Bayesian analysis for known overall mean

To express this situation in terms of a hierarchical model, we need to suppose that the parameters come from some population, and the simplest possibility is to suppose that

in which case it is convenient to take . With the additional structure assumed for the means, the problem has the structure of a situation variously described as a random effects model, Model II or a components of variance model (cf. Eisenhart et al., 1947, or Scheffé, 1959, Section 7.2, n.7). We are, however, primarily interested in the means and not in the variance
components and , at least for the moment.

It follows that the posterior distribution of given is

where (writing )

and

(cf. Section 2.2 on ‘Normal Prior and Likelihood’).

To minimize the expectation of the loss over the posterior distribution of , it is clearly necessary to use the Bayes estimator

where

the posterior mean of given (see the subsection of Section 7.5 on ‘Point estimators resulting from quadratic loss’). Further, if we do this, then the value of this posterior expected loss is

It follows that the Bayes risk

(the expectation being taken over values of ) is

We note that if instead we use the maximum likelihood estimator , then the posterior expected loss is increased by an amount

which is always positive, so that

Further, since the unconditional distribution of Xi is evidently , so that , its expectation over repeated sampling (the Bayes risk) is

This is, in fact, obvious since we can also write

where the expectation is over , and since for the maximum likelihood estimator =1 for all , we have

We can thus see that use of the Bayes estimator always diminishes the posterior loss, and that the amount ‘saved’ by its use averages out at λ over repeated sampling.

8.2.3 The empirical Bayes approach

Typically, however, you will not know (or equivalently λ). In such a situation, you can attempt to estimate it from the data. Since the Xi have an unconditional distribution which is , it is clear that S1 is a sufficient statistic for , or equivalently for λ, which is such that or

so that if we define

then using the probability density of a chi-squared distribution (as given in Appendix A)

so that is an unbiased estimator of λ.

Now consider the effect of using the empirical Bayes estimator

which results from replacing λ by in the expression for . If we use this, then the value of the posterior expected loss exceeds that incurred by the Bayes rule by an amount

which is always positive, so that

Further, if we write (so that and ), then we see that the expectation of over repeated sampling is

It follows that the Bayes risk resulting from the use of the empirical Bayes estimator is

as opposed to for the Bayes estimator or 1 for the maximum likelihood estimator.

8.3 The baseball example

Efron and Morris’s example on baseball statistics was outlined in Section 8.1. As their primary data, they take the number of times hits Si or equivalently the batting averages Yi=Si/n of r=18 major league players as they were recorded after n=45 times at bat in the 1970 season. These were, in fact, all the players who happened to have batted exactly 45 times the day the data were tabulated. If Xi and are as in Section 8.1, so that approximately

then we have a case of the hierarchical normal model. With the actual data, we have

and so with

the empirical Bayes estimator for the takes the form

so giving estimates

We can test how well an estimator performs by comparing it with the observed batting averages. We suppose that the ith player had Ti hits and was at bat mi times, so that his batting average for the remainder of the season was pi=Ti/mi. If we write

we could consider a mean square error

or more directly

In either case, it turns out that the empirical Bayes estimator appears to be about three and a half times better than the ‘obvious’ (maximum likelihood) estimator which ignores the hierarchical model and just estimates each by the corresponding Xi. The original data and the resulting estimates are tabulated in Table 8.1.

Table 8.1 Data for the baseball example.

So there is evidence that in at least some practical case, use of the hierarchical model and a corresponding empirical Bayes estimator is genuinely worth while.

8.4 The Stein estimator

This section is about an aspect of classical statistics which is related to the aforementioned discussion, but an understanding of it is by no means necessary for developing a knowledge of Bayesian statistics per se. The Bayesian analysis of the hierarchical normal model is continued in Section 8.5.

One of the most puzzling and provocative results in classical statistics in the past half century was Stein’s startling discovery (see Stein, 1956, and James and Stein, 1961) that the ‘obvious’ estimator of the multivariate normal mean is inadmissible if . In fact if c is any constant with

then

dominates . The best value of c is r–2, leading to the James–Stein estimator

Because it may be considered as a weighted mean of and , it is often called a shrinkage estimator which ‘shrinks’ the ordinary estimator towards , despite the fact that if S1< r–2 it ‘shrinks’ past . Note, incidentally, that points which are initially far from are little affected by this shrinkage. Of course, this ties in with the results of Section 8.3 because the James–Stein estimator has turned out to be just the same as the empirical Bayes estimator .

In fact, it can be shown that the risk of is

The expectation on the right-hand side depends on and , but as it must be non-negative, so that dominates , that is,

for all . It turns out that S1 has a distribution which depends solely on r and the quantity

a fact which can be proved by considering an orthogonal transformation of the variates to variates Wi such that

Evidently if then , and in general we say that S1 has a non-central chi-squared distribution on r degrees of freedom with non-centrality parameter . We denote this by

It is fairly obvious that as typical values of will tend to infinity and we will get

whereas when the variate S1 has a central distribution on r degrees of freedom (no parameters are estimated), so

and hence

which, particularly for large values of r, is notably less than the risk of the obvious estimator.

In the particular case where the arbitrary origin is taken at 0 the James–Stein estimator takes the form

but it is important to note that this is only a special case.

Variants of the James–Stein estimator have been derived. For example, if c is any constant with

then

dominates , this time provided (loss of one dimension as a result of estimating a mean is something we are used to in statistics). The best value of c in this case is k–3, leading to the Efron–Morris estimator

In this case the ‘shrinkage’ is towards the overall mean.

In the case of the Efron–Morris estimator, it can be shown (see Lehmann, 1983, Section 4.6) that the risk of is

Since S has a central distribution on r–1 degrees of freedom,

and hence

which, particularly for large values of r, is again notably less than the risk of the obvious estimator.

When we consider using such estimates in practice we encounter the ‘speed of light’ rhetorical question,

Do you mean that if I want to estimate tea consumption in Taiwan, I will do better to estimate simultaneously the speed of light and the weight of hogs in Montana?

The question then arises as to why this happens. Stein’s own explanation was that the sample distance squared of from , that is , overestimates the squared distance of from and hence that the estimator could be improved by bringing it nearer (whatever is). Following an idea due to Lawrence Brown, the effect was illustrated as shown in Figure 8.1 in a paper by Berger (1980, Figure 2, p. 736).

Figure 8.1 Shrinkage estimators.

The four points , , represent a spherical distribution centred at .

Consider the effect of shrinking these points as shown. The points and move, on average, slightly further away from , but the points and move slightly closer (while distant points hardly move at all). In three dimensions, there are a further two points (not on the line between and ) that are shrunk closer to .

Another explanation that has been
offered is that can be viewed as a ‘pre-test’ estimator: if one performs a preliminary test of the hypothesis that and then uses or depending on the outcome of the test, then the resulting estimator is a weighted average of and of which is a smoothed version, although why this particular smoothing is to be used is not obvious from this chain of reasoning (cf. Lehmann, 1983, Section 4.5).

8.4.1 Evaluation of the risk of the James–Stein estimator

We can prove that the James–Stein estimator has the risk quoted earlier, namely

[An alternative approach can be found in Lehmann (1983, Sections 4.5 and 4.6).] We proceed by writing

where the expectations are over repeated sampling for fixed . The function g depends on alone by spherical symmetry about . Similarly, the function h depends on alone since . We note that because the unconditional distribution of S1 is , we have

‹ Prev Next ›