Bayesian Statistics (4th ed)
Page 18
6.3.3 Example
This example goes to show that what I naïvely thought to be true of York’s weather is, in fact, false. I guessed that if November was wet, the same thing would be true in December, and so I thought I would try and see how far this December’s rainfall could be predicted in terms of November’s. It turns out that the two are in fact negatively correlated, so that if November is very wet there is a slight indication that December will be on the dry side. However, the data (given in mm) serves quite as well to indicate the method.
It turns out that , , Sxx=13, 539, Syy=1889 and Sxy=–2178, so that
It follows that
Since the 75th percentile of t8 is 0.706, it follows that a 50% HDR for the intercept α is , that is, (37.7, 43.9). Similarly, a 50% HDR for the slope β is , that is, (–0.245, –0.077). Further, from tables of values of corresponding to HDRs for , an interval of posterior probability 50% for the variance is from 1538/11.079 to 1538/5.552, that is, (139, 277).
Very often the slope β is of more importance than the intercept α. Thus, in the above example, the fact that the slope is negative with high probability corresponds to the conclusion that high rainfall in November indicates that there is less likely to be high rainfall in December, as was mentioned earlier.
6.3.4 Case of known variance
If, which is not very likely, you should happen to know the variance , the problem is even simpler. In this case, it is easy to deduce that (with the same priors for α and β)
It is clear that in this case the posteriors for α and β are independent and such that and .
6.3.5 The mean value at a given value of the explanatory variable
Sometimes there are other quantities of interest than α, β and . For example, you might want to know what the expected value of y is at a given value of x. A particular case would arise if you wanted estimate the average weight of women of a certain height on the basis of data on the heights and weights of n individuals. Similarly, you might want to know about the value of the parameter in the original formulation (which corresponds to the particular value x = 0). Suppose that the parameter of interest is
Now we know that for given , , x0 and
independently of one another. It follows that, given the same values,
It is now easy to deduce from the fact that has a (multiple of an) inverse chi-squared distribution. The same arguments used in Section 2.12 on ‘Normal mean and variance both unknown’ can be used to deduce that
In particular, setting x0=0 and writing we get
6.3.6 Prediction of observations at a given value of the explanatory variable
It should be noted that if you are interested in the distribution of a potential observation at a value x=x0, that is, the predictive distribution, then the result is slightly different. The mean of such observations conditional on , and x0 is still , but since
in addition to the above distribution for , it follows that
and so on integrating out
6.3.7 Continuation of the example
To find the mean rainfall to be expected in December in a year when there are x0=46.1 mm in November, we first find and , and hence . Then the distribution of the expected value at x=x0 is N(42.7, 4.602). On the other hand, in single years in which the rainfall in November is 46.1, there is a greater variation in the December rainfall than the variance for the mean of 4.602=21.2 implies – in fact and the corresponding variance is 14.62=213.2.
6.3.8 Multiple regression
Very often there is more than one explanatory variable, and we want to predict the value of y using the values of two or more variables x(1), x(2), etc. It is not difficult to adapt the method described earlier to estimate the parameters in a model such as
although you will find some complications unless it happens that
For this reason, it is best to deal with such multiple regression problems by using matrix analysis. Readers who are interested will find a brief introduction to this topic in Section 6.7, while a full account can be found in Box and Tiao (1992).
6.3.9 Polynomial regression
A difficult problem which will not be discussed in any detail is that of polynomial regression, that is, of of fitting a model
where all the parameters, including the degree r of the polynomial are unknown a priori. Some relevant references are Jeffreys (1961, Sections 5.9–5.92) and Sprent (1969, Sections 5.4, 5.5). There is also an interesting discussion in Meyer and Collier (1970, p. 114 et seq.) in which Lindley starts by remarking:
I agree the problem of fitting a polynomial to the data is one that at the moment I can’t fit very conveniently to the Bayesian analysis. I have prior beliefs in the smoothness of the polynomial. We need to express this idea quantitatively, but I don’t know how to do it. We could bring in our prior opinion that some of the regression coefficients are very small.
Subsequently, a Bayesian approach to this problem has been developed by Young (1977).
6.4 Conjugate prior for the bivariate regression model
6.4.1 The problem of updating a regression line
In Section 6.3, we saw that with the regression line in the standard form the joint posterior is
For reasons that will soon emerge, we denote the quantities derived from the data with a prime as , , , , , , , etc. In the example on rainfall, we found that and
Now suppose that we collect further data, thus
If this had been all the data available, we would have constructed a regression line based on data for years, with
If, however, we had all 16 years data, then the regression line would have been based on data for n = 16 years resulting in
6.4.2 Formulae for recursive construction of a regression line
By the sufficiency principle it must be possible to find , , b, etc., from , , , etc., and , , , etc. It is in fact not too difficult to show that n, and are given by
and that if we define
then Sxx, b and See are given by
Of these formulae, the only one that is at all difficult to deduce is the last, which is established thus
(it is easily checked that there is no term Scee). However,
giving the result.
With the data in the example, it turns out that and (a weighted mean of and ) is 60.3, and similarly . Moreover
so that
in accordance with the results quoted earlier obtained by considering all 16 years together.
6.4.3 Finding an appropriate prior
In the aforementioned analysis, our prior knowledge could be summarized by saying that if the regression line is put in the form
then
We then had observations (denoted and ) that resulted in a posterior which is such that if the regression line is put in the form
then
This, of course, gives a way of incorporating prior information into a regression model provided that it can be put into the form which occurs above. It is, however, often quite difficult to specify prior knowledge about a regression line unless, as in the case above, it is explicitly the result of previous data. Appropriate questions to ask to fix which prior of the class to use are as follows:
1. What number of observations is my present knowledge worth? Write the answer as .
2. What single point is the regression line most likely to go through? Write the answer as .
3. What is the best guess as to the slope of the regression line? Write the answer as .
4. What is the best guess as to the variance of the observation yi about the regression line? Write the answer as and find as .
5. Finally make such that the estimated variances for the slope β and the intercept α are in the ratio to .
As noted above, it is difficult to believe that this process can be carried out in a very convincing manner, although the first three steps do not present as much difficulty as the last two. However, the case where information is received and then more information of the same type is used to update the regression line can be useful.
> It is of course possible (and indeed simpler) to do similar things with the correlation coefficient.
6.5 Comparison of several means – the one way model
6.5.1 Description of the one way layout
Sometimes we want to compare more than two samples. We might, for example, wish to compare the performance of children from a number of schools at a standard test. The usual model for such a situation is as follows. We suppose that is a vector of unknown parameters and that there are independent observations
from I independent populations with, however, a common variance . For simplicity, we shall assume independent reference priors uniform in and , that is
The likelihood is
where
and so the posterior is
It is useful to define the following notation
The reason for thinking of the is that we are often concerned as to whether all the are equal. If, for example, the xik represent yields of wheat on fields on which I different fertilizers have been used, then we are likely to be interested in whether the yields are on average all equal (or nearly so), that is, or equivalently whether or not
The satisfy the condition
so that if we know the values of we automatically know
Similarly the satisfy .
6.5.2 Integration over the nuisance parameters
Since the Jacobian determinant of the transformation which takes , to consists of entries all of which are 1/n, 1 or 0, its value is a constant, and so
The thing to do now is to re-express S in terms of . Since and it follows that
It is easily checked that sums of products of terms on the right vanish, and so it easily follows that
where
It is also useful to define
It follows that the posterior may be written in the form
As explained earlier, the value of λ is not usually of any great interest, and it is easily integrated out to give
The variance can now be integrated out in just the same way as it was in Section 2.12 on ‘Normal mean and variance both unknown’ by reducing to a standard gamma function integral. The result is that
where
This is similar to a result obtained in one dimension (see Section 2.12 again; the situation there is not quite that we get by setting I = 1 here because here λ has been integrated out). In that case we deduced that
where
By analogy with that situation, the posterior distribution for is called the multivariate t distribution. It was discovered independently by Cornish (1954 and 1955) and by Dunnett and Sobel (1954). The constant of proportionality can be evaluated, but we will not need to use it.
It should be clear that the density is a maximum when and decreases as the distance from to , and indeed an HDR for is clearly a hyperellipsoid centred on , that is, it is of the form
in which the length of each of the axes is in a constant ratio to .
To find an HDR of any particular probability it therefore suffices to find the distribution of , and since is a ratio of sums of squares divided by appropriate numbers of degrees of freedom it seems reasonable to conjecture that
which is indeed so.
6.5.3 Derivation of the F distribution
It is not really necessary to follow this proof that really has got an F distribution, but it is included for completeness.
where V(F) is the volume of the hyperellipsoid E(F). At first sight it appears that this is I-dimensional, but because it represents the intersection of a hyperellipsoid in I dimensions with a hyperplane through its centre, which is a hyperellipsoid in (I–1) dimensions. If this is not clear, it may help to note that an ordinary sphere in three-dimensional space cuts a plane in a circle, that is, a sphere in 3–1=2 dimensions. It follows that
and hence
It follows that the density of is proportional to
Comparing this with the standard form in Appendix A and noting that it can be seen that indeed , as asserted.
6.5.4 Relationship to the analysis of variance
This relates to the classical approach to the one-way layout. Note that if
then at the point which represents no treatment effect. Consequently if
then is the probability of an HDR which just includes . It is thus possible to carry out a significance test at level α of the hypothesis that in the sense of Section 4.3 on ‘Lindley’s method’ by rejecting if and only if
This procedure corresponds exactly to the classical analysis of variance (ANOVA) procedure in which you construct a table as follows. First find
It is convenient to write St for . Then find Se by subtraction as it is easily shown that
In computing, it should be noted that it makes no difference if a constant is subtracted from each of the xik and that ST and St can be found by
where is the total for treatment i, is the grand total, and C=G2/N is the ‘correction for error’. (Note that these formulae are subject to rounding error if used incautiously.)
The value of is then found easily by setting out a table as follows:
ANOVA Table
We will now consider an example.
6.5.5 Example
Cochran and Cox (1957, Section 4.13) quote the following data from an experiment on the effect of sulphur in reducing scab disease in potatoes. In addition to untreated plots which serve as a control, three amounts of dressing were compared: 300, 600 and 1200 pounds per acre. Both an autumn and a spring application of each treatment were tried, so that in all there were seven distinct treatments. The effectiveness of the treatments were measured by the ‘scab index’, which is (roughly speaking) the average percentage of the area of 100 potatoes taken at random from each plot that is affected with scab. The data are as follows:
There are I = 7 treatments and observations, the grand total being G = 501 (and the grand average being 15.66), the crude sum of squares being and the correction for error C=G2/N=7844. Further
and hence the analysis of variance table is as follows:
ANOVA Table
From tables of the F distribution an F6,25 variable exceeds 3.63 with probability 0.01. Consequently a 99% HDR is
so that and, according to the methodology of Lindley’s method, as described in Section 4.3, the data is very nearly enough to cause the null hypothesis of no treatment effect to be rejected at the 1% level.
The 99% HDR can be re-expressed by noting that is in it if and only if or
that is, if and only if
It is of course difficult to visualize such sets, which is one reason why the significance test mentioned earlier is helpful in giving some ideas as to what is going on. However, as was explained when significance tests were first introduced, they should not be taken too seriously – in most cases, you would expect to see a treatment effect, even if only a small one. One point is that you can get some idea of the size of the treatment effect from the significance level.
6.5.6 Relationship to a simple linear regression model
A way of visualizing the analysis of variance in terms of the simple linear regression model was pointed out by Kelley (1927, p. 178); see also Novick and Jackson (1974, Section 4–7).
Kelley’s work is relevant to a random effects model (sometimes known as a components of variance model or Model II for the analysis of variance). An idea of what this is can be gained by considering an example quoted by Scheffé (1959, Section 7.2). Suppose a machine is used by different workers on different days, being used by worker i on Ki days for , and that the output when worker i uses it on day k is xik. Then it might be reasonable to suppose that
where mi is the ‘true’ mean for the ith worker and eik is his ‘error’ on the kth day. We could then assume that the I workers are a random sample from a large labour pool, instead of contributing fixed if unknown effects. In such a case, all of our knowledge of the xik contributes to knowledge of the distribution of the mi, and so if we want to estimate a particular mi we should take into account the observations for as well as the observations xi
k. Kelley’s suggestion is that we treat the individual measurements xik as the explanatory variable and the treatment means as the dependent variable, so that the model to be fitted is
where the are error terms of mean zero, or equivalently
In terms of the notation, we used in connection with simple linear regression
In accordance with the theory of simple linear regression, we estimate α and β by, respectively,
so that the regression line takes the form
The point of this formula is that if you were to try one single replicate with another broadly similar treatment to those already tried, you could estimate the overall mean for that treatment not simply by the one observation you have for that treatment, but by a weighted mean of that observation and the overall mean of all observations available to date.
6.5.7 Investigation of contrasts