Bayesian Statistics (4th ed)
Page 16
5.3.3 Example
Yet again we shall consider the data on the weight growth of rats as in Sections 5.1 and 5.2. Recall that m = 12, n = 7 (so , ), , , Sx=5032, Sy=2552, and hence s2x=457, s2y=425. Therefore,
so that radians using rounded values, and thus . From the tables in the Appendix the 95% point of is 1.91 and that of is 1.88, so the 95% point of must be about 1.89. [The program in Appendix C gives hbehrens(0.9,11,6,39) as the interval (–1.882742, 1.882742).] Consequently a 90% HDR for δ is given by and so is , that is, (0, 38). This is slightly wider than was obtained in the previous section, as is reasonable, because we have made ewer assumptions and can only expect to get less precise conclusions.
The same result can be obtained directly from Patil’s approximation. The required numbers turn out to be f1 = 1.39, f2 = 0.44, b = 8.39, a = 1.03, so that . Interpolating between the 95% percentage points for t8 and t9 (which are 1.860 and 1.833, respectively), the required percentage point for t8.39 is 1.849, and hence a 90% HDR for δ is , giving a very similar answer to that obtained from the tables. [The program in Appendix C gives this probability as .]
Of course, it would need more extensive tables to find, for example, the posterior probability that , but there is no difficulty in principle in doing so. On the other hand, it would be quite complicated to find the Bayes factor for a test of a point null hypothesis such as δ = 0, and since such tests are only to be used with caution in special cases, it would not be likely to be worthwhile.
5.3.4 Substantial prior information
If we do happen to have substantial prior information about the parameters which can reasonably well be approximated by independent normal/chi-squared distributions for and , then the method of this section can usually be extended to include it. All that will happen is that Tx and Ty will be replaced by slightly different quantities with independent t distributions, derived as in Section 2.12 on ‘Normal mean and variance both unknown’. It should be fairly clear how to carry out the details, so no more will be said about this case.
5.4 The Behrens–Fisher controversy
5.4.1 The Behrens–Fisher problem from a classical standpoint
As pointed out in Section 2.6 on ‘Highest Density Regions’, in the case of a single normal observation of known variance there is a close relationship between classical results and Bayesian results using a reference prior, which can be summarized in terms of the ‘tilde’ notation by saying that, in classical statistics, results depend on saying that
while Bayesian results depend on saying that
As a result of this, if then the observation x = 5, say, leads to the same interval, , which is regarded as a 95% confidence interval for θ by classical statisticians and as a 95% HDR for θ by Bayesians (at least if they are using a reference prior). It is not hard to see that very similar relationships exist if we have a sample of size n and replace x by , and also when the variance is unknown (provided that the normal distribution is replaced by the t distribution).
There is also no great difficulty in dealing with the case of a two sample problem in which the variances are known. If they are unknown but equal (i.e. ), it was shown that if
then the posterior distribution of t is Student’s on degrees of freedom. A classical statistician would say that this ‘pivotal quantity’ has the same distribution whatever are, and so would be able to give confidence intervals for δ which were exactly the same as HDRs derived by a Bayesian statistician (always assuming that the latter used a reference prior).
This seems to suggest that there is always likely to be a way of interpreting classical results in Bayesian terms and vice versa, provided that a suitable prior distribution is used. One of the interesting aspects of the Behrens–Fisher problem is that no such correspondence exists in this case. To see why, recall that the Bayesian analysis led us to conclude that
where
Moreover, changing the prior inside the conjugate family would only alter the parameters slightly, but would still give results of the same general character. So if there is to be a classical analogue to the Bayesian result, then if T is regarded as a function of the data and for fixed values of the parameters , it must have Behrens’ distribution over repeated samples and . There is an obvious difficulty in this, in that the parameter θ depends on the samples, whereas there is no such parameter in the normal or t distributions. However, it is still possible to investigate whether the sampling distribution of T depends on the parameters .
It turns out that its distribution over-repeated sampling does not just depend on the sample sizes m and n – it also depends on the ratio (which is not, in general, known to the statistician). It is easiest to see this when m = n and so (say). We first suppose that (unknown to the statistician) it is in fact the case that . Then the sampling distribution found in Section 5.2, for the case where the statistician did happen to know that must still hold (his or her ignorance can scarcely affect what happens in repeated sampling). Because if m = n then
in the notation of Section 5.2, it follows that
On the other hand if , then necessarily Sy=0 and so s2y=0, and hence . Since it must also be the case that and so , the distribution of T is given by
that is, T has a t distribution on degrees of freedom. For intermediate values of the distribution over repeated samples is intermediate between these forms (but is not, in general, a t distribution).
5.4.2 Example
Bartlett (1936) quotes an experiment in which the yields xi (in pounds per acre) on m plots for early hay were compared with the yields yi for ordinary hay on another n plots. It turned out that m=n=7 (so ), , , s2x=308.6, s2y=1251.3. It follows that , and so that radians. The Bayesian analysis now proceeds by saying that . By interpolation in tables of Behrens’ distribution a 50% HDR for δ is , that is, (28.6, 51.0). [Using the program in Appendix C we find that hbehrens(0.5,6,6,26) is the interval (–0.7452094, 0.7452094).]
A classical statistician who was willing to assume that would use tables of t12 to conclude that a 50% confidence interval was , that is, (29.4, 50.2). This interval is different, although not very much so, from the Bayesian’s HDR. Without some assumption such as he or she would not be able to give any exact answer.
5.4.3 The controversy
The Bayesian solution was championed by Fisher (1935, 1937, 1939). Fisher had his own theory of fiducial inference which does not have many adherents nowadays, and did not in fact support the Bayesian arguments put forward here. In an introduction to Fisher (1939) reprinted in his Collected Papers, Fisher said that
Pearson and Neyman have laid it down axiomatically that the level of significance of a test must be equal to the frequency of a wrong decision ‘in repeated samples from the same population’. The idea was foreign to the development of tests of significance given by the author in [Statistical Methods for Research Workers], for the experimenter’s experience does not consist in repeated samples from the same population, although in simple cases the numerical values are often the same; and it was, I believe, this coincidence which misled Pearson and Neyman, who were not very familiar with the ideas of ‘Student’ and the author.
Although Fisher was not a Bayesian, the above quotation does put one of the objections which any Bayesian must have to classical tests of significance.
In practice, classical statisticians can at least give intervals which, while they may not have an exact significance level, have a significance level between two reasonably close bounds. A recent review of the problem is given by Robinson (1976, 1982).
5.5 Inferences concerning a variance ratio
5.5.1 Statement of the problem
In this section, we are concerned with the data of the same form as we met in the Behrens–Fisher problem. Thus say, we have independent vectors and such that
where all of are unknown. The difference is that in this case the quantity of interest is the ratio
of the two unknown variances, so that the intention is to discover how much more (or less) variable the one
population is than the other. We shall use the same notation as before, and in addition we will find it useful to define
Again, we shall begin by assuming a reference prior
As was shown in Section 2.12 on ‘Normal mean and variance both unknown’, the posterior distributions of and ψ are independent and such that and so that
It turns out that has (Snedecor’s) F distribution on and degrees of freedom (or equivalently that its reciprocal has an F distribution on and degrees of freedom). The proof of this fact, which is not of great importance and can be omitted if you are prepared to take it for granted, is in Section 5.6.
The result is of the same type, although naturally the parameters are slightly different, if the priors for and ψ are from the conjugate family. Even if, by a fluke, we happened to know the means but not the variances, the only change would be an increase of 1 in each of the degrees of freedom.
5.5.2 Derivation of the F distribution
In order to find the distribution of κ, we need first to change variables to , noting that
It follows that
It is now easy enough to integrate out by substituting where and thus reducing the integral to a standard gamma function integral (cf. Section 2.12 on ‘Normal mean and variance both unknown’). Hence,
Defining k and as above, and noting that is constant, this density can be transformed to give
From Appendix A, it can be seen that this is an F distribution on and degrees of freedom, so that
Note that by symmetry
For most purposes, it suffices to think of an F distribution as being, by definition, the distribution of the ratio of two chi-squared (or inverse chi-squared) variables divided by their respective degrees of freedom.
5.5.3 Example
Jeffreys (1961, Section 5.4) quotes the following data (due to Lord Rayleigh) on the masses xi in grammes of m = 12 samples of nitrogen obtained from air (A) and the masses yi of n = 8 samples obtained by chemical method (C) within a given container at standard temperature and pressure.
It turns out that , , , , so that k=19/1902=0.010. Hence, the posterior of κ is such that
or equivalently
This makes it possible to give an interval in which we can be reasonably sure that the ratio κ of the variances lies. For reasons similar to those for which we chose to use intervals corresponding to HDRs for in Section 2.8 on ‘HDRs for the normal variance’, it seems sensible to use intervals corresponding to HDRs for log F. From the tables in the Appendix, such an interval of probability 90% for F11,7 is (0.32, 3.46), so that κ lies in the interval from 0.01/3.46 to 0.01/0.32, that is (0.003, 0.031), with a posterior probability of 90%. Because the distribution is markedly asymmetric, it may also be worth finding the mode of κ, which (from the mode of as given in Appendix A) is
5.6 Comparison of two proportions; the 2 × 2 table
5.6.1 Methods based on the log-odds ratio
In this section, we are concerned with another two sample problem, but this time one arising from the binomial rather than the normal distribution. Suppose
and that we are interested in the relationship between π and ρ. Another way of describing this situation is in terms of a 2 × 2 table (sometimes called a 2 × 2 contingency table)
We shall suppose that the priors for π and ρ are such that and , independently of one another. It follows that the posteriors are also beta distributions, and more precisely if
then
We recall from Section 3.1 on the binomial distribution that if
then , so that
and similarly for . Now the z distribution is approximately normal (this is the reason that Fisher preferred to use the z distribution rather than the F distribution, which is not so near to normality), and so Λ and are approximately normal with these means and variances. Hence the log-odds ratio
is also approximately normal, that is,
or more approximately
If the Haldane reference priors are used, so that , then , , and , and so
The quantity ad/bc is sometimes called the cross-ratio, and there are good grounds for saying that any measure of association in the 2 × 2 table should be a function of the cross-ratio (cf. Edwards, 1963).
The log-odds ratio is a sensible measure of the degree to which the two populations are identical, and in particular if and only if . On the other hand, knowledge of the posterior distribution of the log-odds ratio does not in itself imply knowledge of the posterior distribution of the difference or the ratio . The approximation is likely to be reasonable provided that all of the entries in the table are at least 5.
5.6.2 Example
The table mentioned later [quoted from Di Raimondo (1951)] relates to the effect on mice of bacterial inoculum (Staphylococcus aureus). Two different types of injection were tried, a standard one and one with 0.15 U of penicillin per millilitre.
The cross-ratio is so its logarithm is –0.150 and a–1+b–1+c–1+d–1=0.245, and so the posterior distribution of the log odds-ratio is . Allowing for the s in the more exact form for the mean does not make much difference; in fact -0.150 becomes -0.169. The posterior probability that , that is, that the log odds ratio is positive, is
The data thus shows no great difference between the injections with and without the penicillin.
5.6.3 The inverse root-sine transformation
In Section 1.5 on ‘Means and variances’, we saw that if , then the transformation resulted in , say, and , and in fact it is approximately true that . This transformation was also mentioned in Section 3.2 on ‘Reference prior for the binomial likelihood’, and pointed out there that one of the possible reference priors for π was , and that this prior was equivalent to a uniform prior in . Now if we use such a prior, then clearly the posterior for ψ is approximately N(z, 1/4m), that is,
This is of no great use if there is only a single binomial variable, but when there are two it can be used to conclude that approximately
and so to give another approximation to the probability that . Thus with the same data as the above, radians, radians, and 1/4m+1/4n=0.0148, so that the posterior probability that is about . The two methods do not give precisely the same answer, but it should be borne in mind that the numbers are not very large, so the approximations involved are not very good, and also that we have assumed slightly different reference priors in deriving the two answers.
If there is non-trivial prior information, it can be incorporated in this method as well as in the previous method. The approximations involved are reasonably accurate provided that x(1–x/m) and y(1–y/n) are both at least 5.
5.6.4 Other methods
If all the entries in the 2 × 2 table are at least 10, then the posterior beta distributions are reasonably well approximated by normal distributions of the same means and variances. This is quite useful in that it gives rise to an approximation to the distribution of which is much more likely to be of interest than some function of π minus the same function of ρ. It will therefore allow us to give an approximate HDR for or to approximate the probability that lies in a particular interval.
In quite a different case, where the values of π and ρ are small, which will be reflected in small values of x/m and y/n, then the binomial distributions can be reasonably well approximated by Poisson distributions, which means that the posteriors of π and ρ are multiples of chi-squared distributions (cf. Section 3.4 on ‘The Poisson distribution’). It follows from this that the posterior of is a multiple of an F distribution (cf. Section 5.5). Again, this is quite useful because is a quantity of interest in itself. The Poisson approximation to the binomial is likely to be reasonable if n> 10 and either x/n< 0.05 or x/n> 0.95 (in the latter case, π has to be replaced by ).
The exact probability that can be worked out in terms of hypergeometric probabilities (cf. Altham, 1969), although the resulting expression is not usually useful for hand computation. It is even possible to give an expression for the posterior probability that (cf. Weisberg, 1972), but this is even more unwieldy
.
5.7 Exercises on Chapter 5
1. Two analysts measure the percentage of ammonia in a chemical process over 9 days and find the following discrepancies between their results:
Investigate the mean discrepancy θ between their results and in particular give an interval in which you are 90% sure that it lies.
2. With the same data as in the previous question, test the hypothesis that there is no discrepancy between the two analysts.
3. Suppose that you have grounds for believing that observations xi, yi for , n are such that and also , but that you are not prepared to assume that the are equal. What statistic would you expect to base inferences about θ on?
4. How much difference would it make to the analysis of the data in Section 5.1 on rat diet if we took instead of .
5. Two analysts in the same laboratory made repeated determinations of the percentage of fibre in soya cotton cake, the results being as shown:
Investigate the mean discrepancy θ between their mean determinations and in particular give an interval in which you are 90% sure that it lies
(a) assuming that it is known from past experience that the standard deviation of both sets of observations is 0.1, and
(b) assuming simply that it is known that the standard deviations of the two sets of observations are equal.