Bayesian Statistics (4th ed) Page 21 Read online free by Peter M Lee

Home > Other > Bayesian Statistics (4th ed) > Page 21

Bayesian Statistics (4th ed) Page 21

Care should be taken when using reference priors as a representation of prior ignorance. We have already seen in Section 2.4 on ‘Dominant likelihoods’ that the improper densities which often arise as reference priors should be regarded as approximations, reflecting the fact that our prior beliefs about an unknown parameter (or some function of it) are more or less uniform over a wide range. A different point to be aware of is that some ways of arriving at such priors, such as Jeffreys’ rule, depend on the experiment that is to be performed, and so on intentions. (The same objection applies, of course, to arguments based on data translated likelihoods.) Consequently, an analysis using such a prior is not in accordance with the likelihood principle.

To make this clearer, consider a sequence of independent trials, each of which results in success with probability or failure with probability (i.e. a sequence of Bernoulli trials). If we look at the number of successes x in a fixed number n of trials, so that

then, as was shown in Section 3.3, Jeffreys’ rule results in an arc-sine distribution

for the prior.

Now suppose that we decide to observe the number of failures y before the mth success. Evidently, there will be m successes and y failures, and the probability of any particular sequence with that number of successes and failures is . The number of such sequences is , because the y failures and m–1 of the successes can occur in any order, but the sequence must conclude with a success. It follows that

that is, that has a negative binomial distribution (see Appendix A). For such a distribution

so that

Because it follows that

so that Jeffreys’ rule implies that we should take a prior

that is, instead of

7.4.2 Conclusion

Consequently, on being told that an experiment resulted in, say, ten successes and ten failures, Jeffreys’ rule does not allow us to decide which prior to use until we know whether the experimental design involved a fixed number of trials, or waiting until a fixed number of successes, or some other method. This clearly violates the likelihood principle (cf. Lindley, 1971a, Section 12.4); insofar as they appear to include Jeffreys’ work, it is hard to see how Berger and Wolpert (1988, Section 4.1.2) come to the conclusion that ‘… use of noninformative priors, purposely not involving subjective prior opinions … is consistent with the LP [Likelihood Principle]’). Some further difficulties inherent in the notion of a uniform reference prior are discussed in Hill (1980) and in Berger and Wolpert (1988).

However, it has been argued that a reference prior should express ignorance relative to the information which can be supplied by a particular experiment; see Box and Tiao (1992, Section 1.3). In any case, provided they are used critically, reference priors can be very useful, and, of course, if there is a reasonable amount of detail, the precise form of the prior adopted will not make a great deal of difference.

7.5 Bayesian decision theory

7.5.1 The elements of game theory

Only a very brief account of this important topic is included here; readers who want to know more should begin by consulting Berger (1985) and Ferguson (1967).

The elements of decision theory are very similar to those of the mathematical theory of games as developed by von Neumann and Morgenstern (1953), although for statistical purposes one of the players is nature (in some sense) rather than another player. Only those aspects of the theory of games which are strictly necessary are given here; an entertaining popular account is given by Williams (1966). A two-person zero-sum game has the following three basic elements:

1. A non-empty set Θ of possible states of nature θ, sometimes called the parameter space;

2. A non-empty set A of actions available to the statistician;

3. A loss function L, which defines the loss which a statistician suffers if he takes action a when the true state of nature is θ (this loss being expressed as a real number).

A statistical decision problem or a statistical game is a game coupled with an experiment whose result x lies in a sample space and is randomly distributed with a density which depends on the state ‘chosen’ by nature. The data x can be, and usually is, more than one-dimensional.

Now suppose that on the basis of the result x of the experiment, the statistician chooses an action , resulting in a random loss . Taking expectations over possible outcomes x of the experiment, we get a risk function

which depends on the true state of nature and the form of the function d by which the action to be taken once the result of the experiment is known is determined. It is possible that this expectation may not exist, or may be infinite, but we shall exclude such cases and define a (nonrandomized) decision rule or a decision function as any function d for which exists and is finite for all .

An important particular case of an action dependent on the outcome of an experiment is that of a point estimators of a parameter θ, that is, to find a single number from the data which in some way best represents a parameter under study.

For classical statisticians, an important notion is admissibility. An estimator is said to dominate an estimator if

for all θ with strict inequality for at least one value of θ. An estimator is said to be inadmissible if it is dominated by some other action . The notion of admissibility is clearly related to that of Pareto optimality. [‘Pareto optimality’ is a condition where no one is worse off in one state than another but someone is better off, and there is no state ‘Pareto superior’ to it (i.e. in which more people would be better off without anyone being worse off).]

From the point of view of classical statisticians, it is very undesirable to use inadmissible actions. From a Bayesian point of view, admissibility is not generally very relevant. In the words of Press (1989, Section 2.3.1),

Admissibility requires that we average over all possible values of the observable random variables (the expectation is taken with respect to the observables). In experimental design situations, statisticians must be concerned with the performance of estimators for many possible situational repetitions and for many values of the observables, and then admissibility is a reasonable Bayesian performance criterion. In most other situations, however, statisticians are less concerned with performance of an estimator over many possible samples that have yet to be observed than they are with the performance of their estimator conditional upon having observed this particular data set conditional upon having observed this particular data set and conditional upon all prior information available. For this reason, in non-experimental design situations, admissibility is generally not a compelling criterion for influencing our choice of estimator.

For the moment we shall, however, follow the investigation from a classical standpoint.

From a Bayesian viewpoint, we must suppose that we have prior beliefs about θ which can be expressed in terms of a prior density . The Bayes risk r(d) of the decision rule d can then be defined as the expectation of over all possible values of θ, that is,

It seems sensible to minimize one’s losses, and accordingly a Bayes decision rule d is defined as one which minimizes the Bayes risk r(d). Now

It follows that if the posterior expected loss of an action a is defined by

then the Bayes risk is minimized if the decision rule d is so chosen that is a minimum for all x (technically, for those who know measure theory, it need only be a minimum for almost all x).

Raiffa and Schlaifer (1961, Sections 1.2–1.3) refer to the overall minimization of r(d) as the normal form of Bayesian analysis and to the minimization of for all x as the extensive form; the aforementioned remark shows that the two are equivalent.

When a number of possible prior distributions are under consideration, one sometimes finds that the term Bayes rule as such is restricted to rules restricting from proper priors, while those resulting from improper priors are called generalized Bayes rules. Further extensions are mentioned in Ferguson (1967).

7.5.2 Point estimators resulting from quadratic loss

A Bayes dec
ision rule in the case of point estimation is referred to as a Bayes estimator. In such problems, it is easiest to work with quadratic loss, that is, with a squared-error loss function

In this case, is the mean square error, that is,

The second term clearly vanishes, so that

which is a minimum when , so that a Bayes estimator d(x) is the posterior mean of θ, and in this case is the posterior variance of θ.

7.5.3 Particular cases of quadratic loss

As a particular case, if we have a single observation where is known and our prior for θ is , so that our posterior is (cf. Section 2.2 on ‘Normal prior and likelihood’), then an estimate of θ that minimizes quadratic loss is the posterior mean , and if that estimate is used the mean square error is the posterior variance .

For another example, suppose that , that is, that x has a Poisson distribution of mean λ, and that our prior density for λ is . First note that the predictive density of x is

To avoid ambiguity in what follows, is used for this predictive distribution, so that just denotes . Then as

it follows that the posterior mean is

We shall return to this example in Section 7.8 in connection with empirical Bayes methods.

Note incidentally that if the prior is , then the posterior is (as shown in Section 3.4 on ‘The Poisson distribution’), so that in this particular case

7.5.4 Weighted quadratic loss

However, you should not go away with the conclusion that the solution to all problems of point estimation from a Bayesian point of view is simply to quote the posterior mean – the answer depends on the loss function. Thus, if we take as loss function a weighted quadratic loss, that is,

then

If we now define

then similar calculations to those above show that

and hence that a Bayes decision results if

that is, d(x) is a weighted posterior mean of θ.

7.5.5 Absolute error loss

A further answer results if we take as loss function the absolute error

in which case

is sometimes referred to as the mean absolute deviation or MAD. In this case, any median m(x) of the posterior distribution given x, that is, any value such that

is a Bayes rule. To show this suppose that d(x) is any other rule and, for definiteness, that d(x)> m(x) for some particular x (the proof is similar if d(x)< m(x)). Then

while for

so that

and hence on taking expectations over θ

from which it follows that taking the posterior median is indeed the appropriate Bayes rule for this loss function. More generally, a loss function which is if but if results in a Bayes estimator which is a v/(u+v) fractile of the posterior distribution.

7.5.6 Zero-one loss

Yet another answer results from the loss function

which results in

Consequently, if a modal interval of length is defined as an interval

which has highest probability for given , then the midpoint of this interval is a Bayes estimate for this loss function. [A modal interval is, of course, just another name for a highest density region (HDR) except for the presumption that will usually be small.] If is fairly small, this value is clearly very close to the posterior mode of the distribution, which in its turn will be close to the maximum likelihood estimator if the prior is reasonably smooth.

Thus all three of mean, median and mode of the posterior distribution can arise as Bayes estimators for suitable loss functions (namely, quadratic error, absolute error and zero-one error, respectively).

7.5.7 General discussion of point estimation

Some Bayesian statisticians pour scorn on the whole idea of point estimators; see Box and Tiao (1992 Section A5.6). There are certainly doubtful points about the preceding analysis. It is difficult to be convinced in any particular case that a particular loss function represents real economic penalties in a particular case, and in many scientific contexts, it is difficult to give any meaning at all to the notion of the loss suffered by making a wrong point estimate. Certainly, the same loss function will not be valid in all cases. Moreover, even with quadratic loss, which is often treated as a norm, there are problems which do not admit of an easy solution. If, for example, , then it would seem reasonable to estimate θ by , and yet the mean square error is infinite. Of course, such decision functions are excluded by requiring that the risk function should be finite, but this is clearly a case of what Good (1965, Section 6.2) referred to as ‘adhockery’.

Even though there are cases (e.g. where the posterior distribution is bimodal) in which there is no sensible point estimator, I think there are cases where it is reasonable to ask for such a thing, though I have considerable doubts as to whether the ideas of decision theory add much to the appeal that quantities such as the posterior mean, median or mode have in themselves.

7.6 Bayes linear methods

7.6.1 Methodology

Bayes linear methods are closely related to point estimators resulting from quadratic loss. Suppose that we restrict attention to decision rules d(x) which are constrained to be a linear function of some known function y=y(x) of x and seek for a rule which, subject to this constraint, has minimum Bayes risk r(d). The resulting rule will not usually be a Bayes rule, but will not, on the other hand, necessitate a complete specification of the prior distribution. As we have seen that it can be very difficult to provide such a specification, there are real advantages to Bayes linear methods. To find such an estimator we need to minimize

(since cross terms involving clearly vanish). By setting , we see that the values and which minimize r satisfy

and then setting we see that

so that the Bayes linear estimator is

It should be noted that, in contrast to Bayes decision rules, Bayes linear estimators do not depend solely on the observed data x – they also depend on the distribution of the data through , and . For that reason they violate the likelihood principle.

7.6.2 Some simple examples

7.6.2.1 Binomial mean

Suppose that and that y=x. Then using the results at the end of Section 1.5 on ‘Means and variances’

Since , we see that

so that

where

Note that the resulting posterior estimator for depends only on and . This is an advantage if you think you can be precise enough about your prior knowledge of to specify these quantities but find difficulty in giving a full specification of your prior which would be necessary for you to find the Bayes estimator for quadratic loss ; the latter cannot be evaluated in terms of a few summary measures of the prior distribution. On the other hand, we have had to use, for example, the fact that thus taking into account observations which might have been, but were not in fact, made, in contravention of the likelihood principle.

7.6.2.2 Negative binomial distribution

Suppose that and that y=z. Then similar formulae are easily deduced from the results in Appendix A. The fact that different formulae from those in the binomial case result when m=x and z=n–x, so that in both cases we have observed x successes and n − x failures, reflects the fact that this method of inference does not obey the likelihood principle.

7.6.2.3 Estimation on the basis of the sample mean

Suppose that x1, x2, … , xn are such that , and but that you know nothing more about the distribution of the xi. Then and

so that using the results at the end of Section 1.5 on ‘Means and variances’ and

Since , we see that

It follows that

where

7.6.3 Extensions

Bayes linear methods can be applied when there are several unknown parameters. A brief account can be found in O’Hagan (1994, Section 6.48 et seq.) and full coverage is given in Goldstein and Wooff (2007).

7.7 Decision theory and hypothesis testing

7.7.1 Relationship between decision theory and classical hypothesis testing

It is possible to reformulate hyp
othesis testing in the language of decision theory. If we want to test versus , we have two actions open to us, namely,

As before, we shall write and for the prior probabilities of and and p0 and p1 for their posterior probabilities and

for the Bayes factor. We also need the notation

where is the prior density function.

Now let us suppose that there is a loss function defined by

so that the use of a decision rule d(x) results in a posterior expected loss function

so that a decision d(x) which minimizes the posterior expected loss is just a decision to accept the hypothesis with the greater posterior probability, which is the way of choosing between hypotheses suggested when hypothesis testing was first introduced.

More generally, if there is a ‘0–Ki’ loss function, that is,

then the posterior expected losses of the two actions are

so that a Bayes decision rule results in rejecting the null hypothesis, that is, in taking action a1, if and only if K1p0< K0p1, that is,

In the terminology of classical statistics, this corresponds to the use of a rejection region

When hypothesis testing was first introduced in Section 4.1, we noted that in the case where and are simple hypotheses, then Bayes theorem implies that

‹ Prev Next ›