by Peter M Lee
Often in circumstances where the treatment effect does not appear substantial you may want to make further investigations. Thus, in the aforementioned example about sulphur treatment for potatoes, you might want to see how the effect of any sulphur compares with none, that is, you might like an idea of the size of
More generally, it may be of interest to investigate any contrast, that is, any linear combination
If we then write and
then it is not difficult to show that we can write
where is a quadratic much like St except that has one less dimension and consists of linear combinations of . It follows that
It is then possible to integrate over the I–2 linearly independent components of to get
and then to integrate out to give
(remember that ), where
It follows that .
For example, in the case of the contrast concerned with the main effect of sulphur, d=–14/6–7=–9.3 and Kd={6(1/6)2/4+12/8}–1=6, so that
so that, for example, as a t25 random variable is less than 2.060 in modulus with probability 0.95, a 95% HDR for d is between , that is, (–15.0, –3.7).
6.6 The two way layout
6.6.1 Notation
Sometimes we come across observations which can be classified in two ways. We shall consider a situation in which each of a number of treatments is applied to a number of plots in each of a number of blocks. However, the terminology need not be taken literally; the terms treatments and blocks are purely conventional and could be interchanged in what follows. A possible application is to the yields on plots on which different varieties of wheat have been sown and to which different fertilizers are applied, and in this case either the varieties or the fertilizers could be termed treatments or blocks. Another is to an analysis of rainfall per hour, in which case the months of the year might be treated as the blocks and the hours of the day as the treatments (or vice versa).
We consider the simplest situation in which we have N=IJK observations
to which I treatments have been applied in J blocks, the observations having a common variance . For simplicity, we assume independent reference priors uniform in the and , that is,
The likelihood is
where
and so the posterior is
As in Section 6.5, we shall use dots to indicate averaging over suffices, so, for example, is the average of xijk over i and k for fixed j. We write
It is conventional to refer to as the main effect of treatment i and to as the main effect of block j. If for all i there is said to be no main effect due to treatments; similarly if for all j there is said to be no main effect due to blocks. Further, is referred to as the interaction of the ith treatment with the jth block, and if for all i and j there is said to be no interaction between treatments and blocks. Note that the parameters satisfy the conditions
so that is (I–1)-dimensional and is (J–1)-dimensional. Similarly for all j
and for all i
Because both of these imply , there are only I+J–1 linearly independent constraints, so that is IJ–I–J+1=(I–1)(J–1)-dimensional.
6.6.2 Marginal posterior distributions
Because
the sum of squares S can be split up in a slightly more complicated way than in Section 6.5 as
where
It is also useful to define
After a change of variable as in Section 6.6, the posterior can be written as
so that on integrating λ out
To investigate the effects of the treatments, we can now integrate over the {(J–1)+(I–1)(J–1)}-dimensional space of values of to get
We can now integrate out just as in Section 6.6 to get
where
Again as in Section 6.6, it can be shown that
and so it is possible to conduct significance tests by Lindley’s method and to find HDRs for .
Similarly, it can be shown that (defined in the obvious way) is distributed as . Moreover, if
then . Thus, it is also possible to investigate the blocks effect and the interaction. It should be noted that it would rarely make sense to believe in the existence of an interaction unless both main effects were there.
6.6.3 Analysis of variance
The analysis is helped by defining
in which Ti is the treatment total , while Bj is the block total , and Cij is the cell total . Finally, G is the grand total , and C is the correction factor G2/N. The resulting ANOVA table is as follows:
ANOVA Table
Other matters, such as the exploration of treatments or blocks contrasts, can be pursued as in the case of the one way model.
6.7 The general linear model
6.7.1 Formulation of the general linear model
All of the last few sections have been concerned with particular cases of the so-called general linear model. It is possible to treat them all at once in an approach using matrix theory. In most of this book, substantial use of matrix theory has been avoided, but if the reader has some knowledge of matrices this section may be helpful, in that the intention here is to put some of the models already considered into the form of the general linear model. An understanding of how these models can be put into such a framework, will put the reader in a good position to approach the theory in its full generality, as it is dealt with in such works as Box and Tiao (1992).
It is important to distinguish row vectors from column vectors. We write for a column vector and for its transpose; similarly if is an matrix then is its transpose. Consider a situation in which we have a column vector of observations, so that (the equation is written thus to save the excessive space taken up by column vectors). We suppose that the xi are independently normally distributed with common variance and a vector of means satisfying
where is a vector of unknown parameters and is a known matrix.
In the case of the original formulation of the bivariate linear regression model in which, conditional on xi, we have then takes the part of , r = 2, takes the part of and takes the part of where
This model is reformulated in terms of and
In the case of the one way model (where, for simplicity, we shall restrict ourselves to the case where Ki=K for all i), and
The two way layout can be expressed similarly using a matrix of 0s and 1s. It is also possible to write the multiple regression model
(the xij being treated as known) as a case of the general linear model.
6.7.2 Derivation of the posterior
Noting that for any vector we have , we can write the likelihood function for the general linear model in the form
Taking standard reference priors, that is, the posterior is
Now as and scalars equal their own transposes
so that if is such that
(so that ), that is, assuming is non-singular,
we have
where
It is also useful to define
Because is of the form , it is always non-negative, and it clearly vanishes if . Further, Se is the minimum value of the sum of squares S and so is positive. It is sometimes worth noting that
as is easily shown.
It follows that the posterior can be written as
In fact, this means that for given the vector has a multivariate normal distribution of mean and variance–covariance matrix .
If you are now interested in as a whole you can now integrate with respect to to get
where
It may also be noted that the set
is a hyperellipsoid in r-dimensional space in which the length of each of the axes is in a constant ratio to . The argument of Section 6.5 on the one way layout can now be adapted to show that , so that E(F) is an HDR for of probability p if F is the appropriate percentage point of .
6.7.3 Inference for a subset of the parameters
However, it is often the case that most of the interest centres on a subset of the parameters, say on . If so, then it is convenient to write and . If it happens that splits into a sum
then it is easy
to integrate
to get
and thus as
where
It is now easy to show that and hence to make inferences for .
Unfortunately, the quadratic form does in general contain terms where but j> k and hence it does not in general split into . We will not discuss such cases further; useful references are Box and Tiao (1992), Lindley and Smith (1972) and Seber (2003).
6.7.4 Application to bivariate linear regression
The theory can be illustrated by considering the simple linear regression model. Consider first the reformulated version in terms of and . In this case
and the fact that this matrix is easy to invert is one of the underlying reasons why this reformulation was sensible. Also
so that Se is what was denoted See in Section 6.3 on bivariate linear regression.
If you are particularly interested in α, then in this case the thing to do is to note that the quadratic form splits with and . Consequently, the posterior distribution of is given by
Since the square of a tn–2 variable can easily be shown to have an F1,n–2 distribution, this conclusion is equivalent to that of Section 6.3.
The greater difficulties that arise when is non-diagonal can be seen by following the same process through for the original formulation of the bivariate linear regression model in terms of and . In this case it is easy enough to find the posterior distribution of , but it involves some rearrangement to get that of .
6.8 Exercises on Chapter 6
1. The sample correlation coefficient between length and weight of a species of frog was determined at each of a number of sites. The results were as follows:
Find an interval in which you are 95% sure that the correlation coefficient lies.
2. Three groups of children were given two tests. The numbers of children and the sample correlation coefficients between the two test scores in each group were as follows:
Is there any evidence that the association between the two tests differs in the three groups?
3. Suppose you have sample correlation coefficients on the basis of sample sizes . Give a 95% posterior confidence interval for .
4. From the approximation
which holds for large n, deduce an expression for the log-likelihood and hence show that the maximum likelihood occurs when . An approximation to the information can now be made by replacing r by ρ in the second derivative of the likelihood, since ρ is near r with high probability. Show that this approximation suggests a prior density of the form
5. Use the fact that
(cf. Edwards, 1921, art. 180) to show that
6. Show that in the special case where the sample correlation coefficient r = 0 and the prior takes the special form the variable
has a Student’s t distribution on k+n+1 degrees of freedom.
7. By writing
and using repeated integration by parts, show that the posterior distribution of ρ can be expressed as a finite series involving powers of
and Student’s t integrals.
8. By substituting
in the form
for the posterior density of the correlation coefficient and then expanding
as a power series in u, show that the integral can be expressed as a series of beta functions. Hence, deduce that
where
9. Fill in the details of the derivation of the prior
from Jeffreys’ rule as outlined at the end of Section 6.1.
10. The following data consist of the estimated gestational ages (in weeks) and weights (in grammes) of 12 female babies:
Give an interval in which you are 90% sure that the gestational age of a particular such baby will lie if its weight is 3000 g, and give a similar interval in which the mean weight of all such babies lies.
11. Show directly from the definitions that, in the notation of Section 6.3,
12. Observations yi for are available which satisfy the regression model
where ui=i and . Adopting the reference prior , show that the posterior distribution of α is such that
where n=2m+1, s2=See/(n–3) and
in which Syy, Suy, etc., are defined by
[Hint: Note that , and hence and Suy=0.]
13. Fisher (1925b, Section 41) quotes an experiment on the accuracy of counting soil bacteria. In it, a soil sample was divided into four parallel samples, and from each of theses after dilution seven plates were inoculated. The number of colonies on each plate is shown below. Do the results from the four samples agree within the limits of random sampling?
14. In the case of the data on scab disease quoted in Section 6.5, find a contrast measuring the effect of the season in which sulphur is applied and give an appropriate HDR for this contrast.
15. The data below [from Wishart and Sanders (1955, Table 5.6)] represent the weight of green produce in pounds made on an old pasture. There were three main treatments, including a control (O) consisting of the untreated land. In the other cases, the effect of a grass-land rejuvenator (R) was compared with the use of the harrow (H). The blocks were, therefore, composed of three plots each, and the experiment consisted of six randomized blocks placed side by side. The plan and yields were as follows:
Derive an appropriate two-way analysis of variance.
16. Express the two-way layout as a particular case of the general linear model.
17. Show that the matrix which arises in the theory of the general linear model is a generalized inverse of the (usually non-square) matrix in that a.
b.
c.
d.
18. Express the bivariate linear regression model in terms of the original parameters and the matrix and use the general linear model to find the posterior distribution of .
7
Other topics
7.1 The likelihood principle
7.1.1 Introduction
This section would logically come much earlier in the book than it is placed, but it is important to have some examples of Bayesian procedures firmly in place before considering this material. The basic result is due to Birnbaum (1962), and a more detailed consideration of these issues can be found in Berger and Wolpert (1988).
The nub of the argument here is that in drawing any conclusion from an experiment only the actual observation x made (and not the other possible outcomes that might have occurred) is relevant. This is in contrast to methods by which, for example, a null hypothesis is rejected because the probability of a value as large as or larger than that actually observed is small, an approach which leads to Jeffreys’ criticism that was mentioned in Section 4.1 when we first considered hypothesis tests, namely, that ‘a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred’. Virtually all of the ideas discussed in this book abide by this principle, which is known as the likelihood principle (there are some exceptions, for example Jeffreys’ rule is not in accordance with it). We shall show that it follows from two other principles, called the conditionality principle and the sufficiency principle, both of which are hard to argue against.
In this section, we shall write x for a particular piece of data, not necessarily one-dimensional, the density of which depends on an unknown parameter θ. For simplicity, we shall suppose that x and θ are discrete (although they may be more than one-dimensional). The triple represents the essential features of an experiment to gain information about θ, and accordingly we shall refer to E as an experiment. Note that the random variable is a feature of the experiment, not the particular value x that may be observed when the experiment is carried out on a particular occasion. If such an experiment is carried out and the value x is observed, then we shall write for the evidence provided about the value of θ by carrying out experiment E and observing the value x. This ‘evidence’ is not presumed to be in any particular form. To a Bayesian, it would normally be the posterior distribution of θ or some feature of it, but for the moment we are not restricting ourselves to Bayesian inference, and a classical statist
ician might consider evidence to be made up of significance levels and confidence intervals, etc., while the notation does not rule out some form of evidence that would be new to both.
For example, you might be interested in the proportion θ of defective articles coming from a factory. A possible experiment E would consist in observing a fixed number n of articles chosen at random and observing the number x defective, so that is a family of binomial densities. To have a definite experiment, it is necessary to give n a specific value, for example, n = 100; once n is known E is fully determined. If we then observe that x = 3, then denotes the conclusions we arrive at about the value of θ.
7.1.2 The conditionality principle
The conditionality principle can be explained as the assertion that if you have decided which of two experiments you performed by tossing a coin, then once you tell me the end result of the experiment, it will not make any difference to any inferences I make about an unknown parameter θ whether or not I know which way the coin landed and hence which experiment was actually performed (assuming that the probability of the coin’s landing ‘heads’ does not in any way depend on θ). For example, if we are told that an analyst has reported on the chemical composition of a sample, then it is irrelevant whether we had always intended to ask him or her to analyze the sample or had tossed a coin to decided whether to ask that scientist or the one in the laboratory next door to analyze it. Put this way, the principle should seem plausible, and we shall now try to formalize it.