by Peter M Lee
Care should be taken when using reference priors as a representation of prior ignorance. We have already seen in Section 2.4 on ‘Dominant likelihoods’ that the improper densities which often arise as reference priors should be regarded as approximations, reflecting the fact that our prior beliefs about an unknown parameter (or some function of it) are more or less uniform over a wide range. A different point to be aware of is that some ways of arriving at such priors, such as Jeffreys’ rule, depend on the experiment that is to be performed, and so on intentions. (The same objection applies, of course, to arguments based on data translated likelihoods.) Consequently, an analysis using such a prior is not in accordance with the likelihood principle.
To make this clearer, consider a sequence of independent trials, each of which results in success with probability or failure with probability (i.e. a sequence of Bernoulli trials). If we look at the number of successes x in a fixed number n of trials, so that
then, as was shown in Section 3.3, Jeffreys’ rule results in an arc-sine distribution
for the prior.
Now suppose that we decide to observe the number of failures y before the mth success. Evidently, there will be m successes and y failures, and the probability of any particular sequence with that number of successes and failures is . The number of such sequences is , because the y failures and m–1 of the successes can occur in any order, but the sequence must conclude with a success. It follows that
that is, that has a negative binomial distribution (see Appendix A). For such a distribution
so that
Because it follows that
so that Jeffreys’ rule implies that we should take a prior
that is, instead of
7.4.2 Conclusion
Consequently, on being told that an experiment resulted in, say, ten successes and ten failures, Jeffreys’ rule does not allow us to decide which prior to use until we know whether the experimental design involved a fixed number of trials, or waiting until a fixed number of successes, or some other method. This clearly violates the likelihood principle (cf. Lindley, 1971a, Section 12.4); insofar as they appear to include Jeffreys’ work, it is hard to see how Berger and Wolpert (1988, Section 4.1.2) come to the conclusion that ‘… use of noninformative priors, purposely not involving subjective prior opinions … is consistent with the LP [Likelihood Principle]’). Some further difficulties inherent in the notion of a uniform reference prior are discussed in Hill (1980) and in Berger and Wolpert (1988).
However, it has been argued that a reference prior should express ignorance relative to the information which can be supplied by a particular experiment; see Box and Tiao (1992, Section 1.3). In any case, provided they are used critically, reference priors can be very useful, and, of course, if there is a reasonable amount of detail, the precise form of the prior adopted will not make a great deal of difference.
7.5 Bayesian decision theory
7.5.1 The elements of game theory
Only a very brief account of this important topic is included here; readers who want to know more should begin by consulting Berger (1985) and Ferguson (1967).
The elements of decision theory are very similar to those of the mathematical theory of games as developed by von Neumann and Morgenstern (1953), although for statistical purposes one of the players is nature (in some sense) rather than another player. Only those aspects of the theory of games which are strictly necessary are given here; an entertaining popular account is given by Williams (1966). A two-person zero-sum game has the following three basic elements:
1. A non-empty set Θ of possible states of nature θ, sometimes called the parameter space;
2. A non-empty set A of actions available to the statistician;
3. A loss function L, which defines the loss which a statistician suffers if he takes action a when the true state of nature is θ (this loss being expressed as a real number).
A statistical decision problem or a statistical game is a game coupled with an experiment whose result x lies in a sample space and is randomly distributed with a density which depends on the state ‘chosen’ by nature. The data x can be, and usually is, more than one-dimensional.
Now suppose that on the basis of the result x of the experiment, the statistician chooses an action , resulting in a random loss . Taking expectations over possible outcomes x of the experiment, we get a risk function
which depends on the true state of nature and the form of the function d by which the action to be taken once the result of the experiment is known is determined. It is possible that this expectation may not exist, or may be infinite, but we shall exclude such cases and define a (nonrandomized) decision rule or a decision function as any function d for which exists and is finite for all .
An important particular case of an action dependent on the outcome of an experiment is that of a point estimators of a parameter θ, that is, to find a single number from the data which in some way best represents a parameter under study.
For classical statisticians, an important notion is admissibility. An estimator is said to dominate an estimator if
for all θ with strict inequality for at least one value of θ. An estimator is said to be inadmissible if it is dominated by some other action . The notion of admissibility is clearly related to that of Pareto optimality. [‘Pareto optimality’ is a condition where no one is worse off in one state than another but someone is better off, and there is no state ‘Pareto superior’ to it (i.e. in which more people would be better off without anyone being worse off).]
From the point of view of classical statisticians, it is very undesirable to use inadmissible actions. From a Bayesian point of view, admissibility is not generally very relevant. In the words of Press (1989, Section 2.3.1),
Admissibility requires that we average over all possible values of the observable random variables (the expectation is taken with respect to the observables). In experimental design situations, statisticians must be concerned with the performance of estimators for many possible situational repetitions and for many values of the observables, and then admissibility is a reasonable Bayesian performance criterion. In most other situations, however, statisticians are less concerned with performance of an estimator over many possible samples that have yet to be observed than they are with the performance of their estimator conditional upon having observed this particular data set conditional upon having observed this particular data set and conditional upon all prior information available. For this reason, in non-experimental design situations, admissibility is generally not a compelling criterion for influencing our choice of estimator.
For the moment we shall, however, follow the investigation from a classical standpoint.
From a Bayesian viewpoint, we must suppose that we have prior beliefs about θ which can be expressed in terms of a prior density . The Bayes risk r(d) of the decision rule d can then be defined as the expectation of over all possible values of θ, that is,
It seems sensible to minimize one’s losses, and accordingly a Bayes decision rule d is defined as one which minimizes the Bayes risk r(d). Now
It follows that if the posterior expected loss of an action a is defined by
then the Bayes risk is minimized if the decision rule d is so chosen that is a minimum for all x (technically, for those who know measure theory, it need only be a minimum for almost all x).
Raiffa and Schlaifer (1961, Sections 1.2–1.3) refer to the overall minimization of r(d) as the normal form of Bayesian analysis and to the minimization of for all x as the extensive form; the aforementioned remark shows that the two are equivalent.
When a number of possible prior distributions are under consideration, one sometimes finds that the term Bayes rule as such is restricted to rules restricting from proper priors, while those resulting from improper priors are called generalized Bayes rules. Further extensions are mentioned in Ferguson (1967).
7.5.2 Point estimators resulting from quadratic loss
A Bayes dec
ision rule in the case of point estimation is referred to as a Bayes estimator. In such problems, it is easiest to work with quadratic loss, that is, with a squared-error loss function
In this case, is the mean square error, that is,
The second term clearly vanishes, so that
which is a minimum when , so that a Bayes estimator d(x) is the posterior mean of θ, and in this case is the posterior variance of θ.
7.5.3 Particular cases of quadratic loss
As a particular case, if we have a single observation where is known and our prior for θ is , so that our posterior is (cf. Section 2.2 on ‘Normal prior and likelihood’), then an estimate of θ that minimizes quadratic loss is the posterior mean , and if that estimate is used the mean square error is the posterior variance .
For another example, suppose that , that is, that x has a Poisson distribution of mean λ, and that our prior density for λ is . First note that the predictive density of x is
To avoid ambiguity in what follows, is used for this predictive distribution, so that just denotes . Then as
it follows that the posterior mean is
We shall return to this example in Section 7.8 in connection with empirical Bayes methods.
Note incidentally that if the prior is , then the posterior is (as shown in Section 3.4 on ‘The Poisson distribution’), so that in this particular case
7.5.4 Weighted quadratic loss
However, you should not go away with the conclusion that the solution to all problems of point estimation from a Bayesian point of view is simply to quote the posterior mean – the answer depends on the loss function. Thus, if we take as loss function a weighted quadratic loss, that is,
then
If we now define
then similar calculations to those above show that
and hence that a Bayes decision results if
that is, d(x) is a weighted posterior mean of θ.
7.5.5 Absolute error loss
A further answer results if we take as loss function the absolute error
in which case
is sometimes referred to as the mean absolute deviation or MAD. In this case, any median m(x) of the posterior distribution given x, that is, any value such that
is a Bayes rule. To show this suppose that d(x) is any other rule and, for definiteness, that d(x)> m(x) for some particular x (the proof is similar if d(x)< m(x)). Then
while for
so that
and hence on taking expectations over θ
from which it follows that taking the posterior median is indeed the appropriate Bayes rule for this loss function. More generally, a loss function which is if but if results in a Bayes estimator which is a v/(u+v) fractile of the posterior distribution.
7.5.6 Zero-one loss
Yet another answer results from the loss function
which results in
Consequently, if a modal interval of length is defined as an interval
which has highest probability for given , then the midpoint of this interval is a Bayes estimate for this loss function. [A modal interval is, of course, just another name for a highest density region (HDR) except for the presumption that will usually be small.] If is fairly small, this value is clearly very close to the posterior mode of the distribution, which in its turn will be close to the maximum likelihood estimator if the prior is reasonably smooth.
Thus all three of mean, median and mode of the posterior distribution can arise as Bayes estimators for suitable loss functions (namely, quadratic error, absolute error and zero-one error, respectively).
7.5.7 General discussion of point estimation
Some Bayesian statisticians pour scorn on the whole idea of point estimators; see Box and Tiao (1992 Section A5.6). There are certainly doubtful points about the preceding analysis. It is difficult to be convinced in any particular case that a particular loss function represents real economic penalties in a particular case, and in many scientific contexts, it is difficult to give any meaning at all to the notion of the loss suffered by making a wrong point estimate. Certainly, the same loss function will not be valid in all cases. Moreover, even with quadratic loss, which is often treated as a norm, there are problems which do not admit of an easy solution. If, for example, , then it would seem reasonable to estimate θ by , and yet the mean square error is infinite. Of course, such decision functions are excluded by requiring that the risk function should be finite, but this is clearly a case of what Good (1965, Section 6.2) referred to as ‘adhockery’.
Even though there are cases (e.g. where the posterior distribution is bimodal) in which there is no sensible point estimator, I think there are cases where it is reasonable to ask for such a thing, though I have considerable doubts as to whether the ideas of decision theory add much to the appeal that quantities such as the posterior mean, median or mode have in themselves.
7.6 Bayes linear methods
7.6.1 Methodology
Bayes linear methods are closely related to point estimators resulting from quadratic loss. Suppose that we restrict attention to decision rules d(x) which are constrained to be a linear function of some known function y=y(x) of x and seek for a rule which, subject to this constraint, has minimum Bayes risk r(d). The resulting rule will not usually be a Bayes rule, but will not, on the other hand, necessitate a complete specification of the prior distribution. As we have seen that it can be very difficult to provide such a specification, there are real advantages to Bayes linear methods. To find such an estimator we need to minimize
(since cross terms involving clearly vanish). By setting , we see that the values and which minimize r satisfy
and then setting we see that
so that the Bayes linear estimator is
It should be noted that, in contrast to Bayes decision rules, Bayes linear estimators do not depend solely on the observed data x – they also depend on the distribution of the data through , and . For that reason they violate the likelihood principle.
7.6.2 Some simple examples
7.6.2.1 Binomial mean
Suppose that and that y=x. Then using the results at the end of Section 1.5 on ‘Means and variances’
Since , we see that
so that
where
Note that the resulting posterior estimator for depends only on and . This is an advantage if you think you can be precise enough about your prior knowledge of to specify these quantities but find difficulty in giving a full specification of your prior which would be necessary for you to find the Bayes estimator for quadratic loss ; the latter cannot be evaluated in terms of a few summary measures of the prior distribution. On the other hand, we have had to use, for example, the fact that thus taking into account observations which might have been, but were not in fact, made, in contravention of the likelihood principle.
7.6.2.2 Negative binomial distribution
Suppose that and that y=z. Then similar formulae are easily deduced from the results in Appendix A. The fact that different formulae from those in the binomial case result when m=x and z=n–x, so that in both cases we have observed x successes and n − x failures, reflects the fact that this method of inference does not obey the likelihood principle.
7.6.2.3 Estimation on the basis of the sample mean
Suppose that x1, x2, … , xn are such that , and but that you know nothing more about the distribution of the xi. Then and
so that using the results at the end of Section 1.5 on ‘Means and variances’ and
Since , we see that
It follows that
where
7.6.3 Extensions
Bayes linear methods can be applied when there are several unknown parameters. A brief account can be found in O’Hagan (1994, Section 6.48 et seq.) and full coverage is given in Goldstein and Wooff (2007).
7.7 Decision theory and hypothesis testing
7.7.1 Relationship between decision theory and classical hypothesis testing
It is possible to reformulate hyp
othesis testing in the language of decision theory. If we want to test versus , we have two actions open to us, namely,
As before, we shall write and for the prior probabilities of and and p0 and p1 for their posterior probabilities and
for the Bayes factor. We also need the notation
where is the prior density function.
Now let us suppose that there is a loss function defined by
so that the use of a decision rule d(x) results in a posterior expected loss function
so that a decision d(x) which minimizes the posterior expected loss is just a decision to accept the hypothesis with the greater posterior probability, which is the way of choosing between hypotheses suggested when hypothesis testing was first introduced.
More generally, if there is a ‘0–Ki’ loss function, that is,
then the posterior expected losses of the two actions are
so that a Bayes decision rule results in rejecting the null hypothesis, that is, in taking action a1, if and only if K1p0< K0p1, that is,
In the terminology of classical statistics, this corresponds to the use of a rejection region
When hypothesis testing was first introduced in Section 4.1, we noted that in the case where and are simple hypotheses, then Bayes theorem implies that