Book Read Free

Bayesian Statistics (4th ed)

Page 13

by Peter M Lee


  2. Find a suitable interval of 90% posterior probability to quote in a case when your posterior distribution for an unknown parameter π is Be(20, 12), and compare this interval with similar intervals for the cases of Be(20.5, 12.5) and Be(21, 13) posteriors. Comment on the relevance of the results to the choice of a reference prior for the binomial distribution.

  3. Suppose that your prior beliefs about the probability π of success in Bernoulli trials have mean 1/3 and variance 1/32. Give a 95% posterior HDR for π given that you have observed 8 successes in 20 trials.

  4. Suppose that you have a prior distribution for the probability π of success in a certain kind of gambling game which has mean 0.4, and that you regard your prior information as equivalent to 12 trials. You then play the game 25 times and win 12 times. What is your posterior distribution for π?

  5. Suppose that you are interested in the proportion of females in a certain organization and that as a first step in your investigation you intend to find out the sex of the first 11 members on the membership list. Before doing so, you have prior beliefs which you regard as equivalent to 25% of this data, and your prior beliefs suggest that a third of the membership is female.

  Suggest a suitable prior distribution and find its standard deviation.

  Suppose that 3 of the first 11 members turn out to be female; find your posterior distribution and give a 50% posterior HDR for this distribution.

  Find the mean, median and mode of the posterior distribution.

  Would it surprise you to learn that in fact 86 of the total number of 433 members are female?

  6. Show that if then

  Deduce that if has a negative binomial distribution of index n and parameter π and z=g(x) then and . What does this suggest as a reference prior for π?

  7. The following data were collected by von Bortkiewicz (1898) on the number of men killed by horses in certain Prussian army corps in twenty years, the unit being one army corps for one year:

  Give an interval in which the mean number λ of such deaths in a particular army corps in a particular year lies with 95% probability.

  8. Recalculate the answer to the previous question assuming that you had a prior distribution for λ of mean 0.66 and standard deviation 0.115.

  9. Find the Jeffreys prior for the parameter α of the Maxwell distribution

  and find a transformation of this parameter in which the corresponding prior is uniform.

  10. Use the two-dimensional version of Jeffreys’ rule to determine a prior for the trinomial distribution

  (cf. Exercise 15 on Chapter 2).

  11. Suppose that x has a Pareto distribution, where is known but γ is unknown, that is,

  Use Jeffreys’ rule to find a suitable reference prior for γ.

  12. Consider a uniform distribution on the interval , where the values of α and β are unknown, and suppose that the joint distribution of α and β is a bilateral bivariate Pareto distribution with . How large a random sample must be taken from the uniform distribution in order that the coefficient of variation (that is, the standard deviation divided by the mean) of the length of the interval should be reduced to 0.01 or less?

  13. Suppose that observations are available from a density

  Explain how you would make inferences about the parameter θ using a conjugate prior.

  14. What could you conclude if you observed two tramcars numbered, say, 71 and 100?

  15. In Section 3.8, we discussed Newcomb’s observation that the front pages of a well-used table of logarithms tend to get dirtier than the back pages do. What if we had an antilogarithm table, that is, a table giving the value of x when log10x is given? Which pages of such a table would be the dirtiest?

  16. We sometimes investigate distributions on a circle (e.g. von Mises’ distribution which is discussed in Section 3.9 on ‘The circular normal distribution’). Find a Haar prior for a location parameter on the circle (such as μ in the case of von Mises’ distribution).

  17. Suppose that the prior distribution for the parameters μ and σ of a Cauchy distribution

  is uniform in μ and σ, and that two observations x1=2 and x2=6 are available from this distribution. Calculate the value of the posterior density (ignoring the factor ) to two decimal places for and . Use Simpson’s rule to approximate the posterior marginal density of μ, and hence go on to find an approximation to the posterior probability that .

  18. Show that if the log-likelihood is a concave function of θ for each scalar x (that is, for all θ), then the likelihood function for θ given an n-sample has a unique maximum. Prove that this is the case if the observations xi come from a logistic density

  where θ is an unknown real parameter. Fill in the details of the Newton–Raphson method and the method of scoring for finding the position of the maximum, and suggest a suitable starting point for the algorithms.

  [In many applications of Gibbs sampling, which we consider later in Section 9.4, all full conditional densities are log-concave (see Gilks et al., 1996, Section 5.3.3), so the study of such densities is of real interest.]

  19. Show that if an experiment consists of two observations, then the total information it provides is the information provided by one observation plus the mean amount provided by the second given the first.

  20. Find the entropy of a (negative) exponential distribution with density .

  1. This problem reappeared as the German tank problem; see Spencer and Largey (1993).

  2. Sometimes denoted DKL(1||2) or KL(1||2).

  4

  Hypothesis testing

  4.1 Hypothesis testing

  4.1.1 Introduction

  If preferred, the reader may begin with the example at the end of this section, then return to the general theory at the beginning.

  4.1.2 Classical hypothesis testing

  Most simple problems in which tests of hypotheses arise are of the following general form. There is one unknown parameter θ which is known to be from a set Θ, and you want to know whether or where

  Usually, you are able to make use of a set of observations whose density depends on θ. It is convenient to denote the set of all possible observations by .

  In the language of classical statistics, it is usual to refer to

  and to

  and to say that if you decide to reject H0 when it is true then you have made a Type I error while if you decide not to reject H0 when it is false then you have made a Type II error.

  A test is decided by a rejection region R where

  Classical statisticians then say that decisions between tests should be based on the probabilities of Type I errors, that is,

  and of Type II errors, that is,

  In general, the smaller the probability of Type I error, the larger the probability of Type II error and vice versa. Consequently, classical statisticians recommend a choice of R which in some sense represents an optimal balance between the two types of errors. Very often R is chosen, so that the probability of a Type II error is as small as possible subject to the requirement that the probability of a Type I error is always less than or equal to some fixed value α known as the size of the test. This theory, which is largely due to Neyman and Pearson, is to be found in most books on statistical inference and is to be found in its fullest form in Lehmann (1986).

  4.1.3 Difficulties with the classical approach

  Other points will be made later about the comparison between the classical and the Bayesian approaches, but one thing to note at the outset is that, in the classical approach, we consider the probability (for various values of θ) of a set R to which the vector x of observations does, or does not, belong. Consequently, we are concerned not merely with the single vector of observations we actually made but also with others we might have made but did not. Thus, classically, if we suppose that and we wish to test whether or is true (negative values being supposed impossible), then we reject H0 on the basis of a single observation x = 3 because the probability that an N(0, 1) random variable is 3 or greater is 0.0
01 350, even though we certainly did not make an observation greater than 3. This aspect of the classical approach led Jeffreys (1961, Section 7.2) to remark:

  What the use of P implies, therefore, is that a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred.

  Note, however, that the form of the model, in this case the assumption of normally distributed observations of unit variance, does depend on an assumption about the whole distribution of all possible observations.

  4.1.4 The Bayesian approach

  The Bayesian approach is in many ways more straightforward. All we need to do is to calculate the posterior probabilities

  and decide between H0 and H1 accordingly. (We note that p0+p1=1 as and .)

  Although posterior probabilities of hypotheses are our ultimate goal we also need prior probabilities

  to find them. (We note that just as p0+p1=1.) It is also useful to consider the prior odds on H0 against H1, namely

  and the posterior odds on H0 against H1, namely

  (The notion of odds was originally introduced in the very first section of this book). Observe that if your prior odds are close to 1 then you regard H0 as more or less as likely as H1 a priori, while if the ratio is large you regard H0 as relatively likely and when it is small you regard it as relatively unlikely. Similar remarks apply to the interpretation of the posterior odds.

  It is also useful to define the Bayes factor B in favour of H0 against H1 as

  The interest in the Bayes factor is that it can sometimes be interpreted as the ‘odds in favour of H0 against H1 that are given by the data’. It is worth noting that because and p1=1–p0 we can find the posterior probability p0 of H0 from its prior probability and the Bayes factor by

  The aforementioned interpretation is clearly valid when the hypotheses are simple, that is,

  for some and . For if so, then and so that

  and hence, the Bayes factor is

  It follows that B is the likelihood ratio of H0 against H1 which most statisticians (whether Bayesian or not) view as the odds in favour of H0 against H1 that are given by the data.

  However, the interpretation is not quite as simple when H0 and H1 are composite, that is, contain more than one member. In such a case, it is convenient to write

  and

  where is the prior density of θ, so that is the restriction of to renormalized to give a probability density over , and similarly for . We then have

  the constant of proportionality depending solely on . Similarly,

  and hence, the Bayes factor is

  which is the ratio of ‘weighted’ (by and ) likelihoods of and .

  Because this expression for the Bayes factor involves and as well as the likelihood function itself, the Bayes factor cannot be regarded as a measure of the relative support for the hypotheses provided solely by the data. Sometimes, however, B will be relatively little affected within reasonable limits by the choice of and , and then we can regard B as a measure of relative support for the hypotheses provided by the data. When this is so, the Bayes factor is reasonably objective and might, for example, be included in a scientific report, so that different users of the data could determine their personal posterior odds by multiplying their personal prior odds by the factor.

  It may be noted that the Bayes factor is referred to by a few authors simply as the factor. Jeffreys (1961) denoted it by K, but did not give it a name. A number of authors, most notably Peirce (1878) and (independently) Good (1950, 1983 and elsewhere), refer to the logarithm of the Bayes factor as the weight of evidence. The point of taking the logarithm is, of course, that if you have several experiments about two simple hypotheses, then the Bayes factors multiply, and so the weight of evidence adds.

  4.1.5 Example

  According to Watkins (1986, Section 13.3), the electroweak theory predicted the existence of a new particle, the W particle, of a mass m of GeV. Experimental results showed that such a particle existed and had a mass of GeV. If we take the mass to have a normal prior and likelihood and assume that the values after the signs represent known standard deviations, and if we are prepared to take both the theory and the experiment into account, then we can conclude that the posterior for the mass is where

  (following the procedure of Section 2.2 on ‘Normal Prior and Likelihood’). Suppose that for some reason it was important to know whether or not this mass was less than 83.0 GeV. Then, since the prior distribution is N(82.4, 1.12), the prior probability of this hypothesis is given by

  where is the distribution function of the standard normal distribution. From tables of the normal distribution, it follows that so that the prior odds are

  Similarly, the posterior probability of the hypothesis that is , and hence the posterior odds are

  Thus, the Bayes factor is

  In this case, the experiment has not much altered beliefs about the hypothesis under discussion, and this is represented by the nearness of B to 1.

  4.1.6 Comment

  A point about hypothesis tests well worth making is that they ‘are traditionally used as a method for testing between two terminal acts [but that] in actual practice [they] are far more commonly used [when we are] given the outcome of a sample [to decide whether] any final or terminal decision [should] be reached or should judgement be suspended until more sample evidence is available’ (Schlaifer, 1961, Section 13.2).

  4.2 One-sided hypothesis tests

  4.2.1 Definition

  A hypothesis testing situation of the type described in Section 4.1 is said to be one-sided if the set Θ of possible values of the parameter θ is the set of real numbers or a subset of it and either

  or

  From the Bayesian point of view, there is nothing particularly special about this situation. The interesting point is that this is one of the few situations in which classical results, and in particular the use of P-values, has a Bayesian justification.

  4.2.2 P-values

  This is one of the places where it helps to use the ‘tilde’ notation to emphasize which quantities are random. If where is known and the reference prior is used, then the posterior distribution of θ given is . Consider now the situation in which we wish to test versus . Then, if we observe that we have a posterior probability

  Now the classical P-value (sometimes called the exact significance level) against H0 is defined as the probability, when , of observing an ‘at least as extreme’ as the actual data x and so is

  For example, if we observe a value of x which is 1.5 standard deviations above then a Bayesian using the reference prior would conclude that the posterior probability of the null hypothesis is , whereas a classical statistician would report a P-value of 0.0668. Of course p1=1–p0=1–P-value, so the posterior odds are

  In such a case, the prior distribution could perhaps be said to imply prior odds of 1 (but beware! – this comes from taking ), and so we get a Bayes factor of

  implying that

  On the other hand, the classical probabilities of Type I and Type II errors do not have any close correspondence to the probabilities of hypotheses, and to that extent the increasing tendency of classical statisticians to quote P-values rather than just the probabilities of Type I and Type II errors is to be welcomed, even though a full Bayesian analysis would be better.

  A partial interpretation of the traditional use of the probability of a Type I error (sometimes called a significance level) is as follows. A result is significant at level α if and only if the P-value is less than or equal to α, and hence if and only if the posterior probability

  or equivalently

  4.3 Lindley’s method

  4.3.1 A compromise with classical statistics

  The following method appears first to have been suggested by Lindley (1965, Section 5.6), and has since been advocated by a few other authors, for example, Zellner (1971, Section 10.2; 1974, Section 3.7).

  Suppose, as is common in classical statistics, that you wish to conduct a test of a poin
t (or sharp) null hypothesis

  Suppose further that your prior knowledge is vague or diffuse, so that you have no particular reason to believe that rather than that where is any value in the neighbourhood of .

  The suggested procedure depends on finding the posterior distribution of θ using a reference prior. To conduct a significance test at level it is then suggested that you find a highest density region (HDR) from the posterior distribution and reject if and only if is outside this HDR.

  4.3.2 Example

  With the data on the uterine weight of rats which you met in Section 2.8 on ‘HDRs for the normal variance’, we found the posterior distribution of the variance to be

  so that an interval corresponding to a 95% HDR for is (19, 67). Consequently, on the basis of the data, there you should reject a null hypothesis at the 5% level, but on the other hand, you should not reject a null hypothesis at that level.

  4.3.3 Discussion

  This procedure is appropriate only when prior information is vague or diffuse and even then it is not often the best way of summarizing posterior beliefs; clearly the significance level is a very incomplete expression of these beliefs. For many problems, including the one considered in the above example, I think that this method is to be seen as mainly of historical interest in that it gave a way of arriving at results related to those in classical statistics and thus helped to wean statisticians brought up on these methods towards the Bayesian approach as one which can get results like these as special cases, as well as having its own distinctive conclusions. However, it can have a use in situations where there are several unknown parameters and the complete posterior is difficult to describe or take in. Thus, when we come to consider the analysis of variance in Sections 6.5 and 6.6, we shall use the significance level as described in this section to give some idea of the size of the treatment effect.

 

‹ Prev