by Peter M Lee
4.4 Point (or sharp) null hypotheses with prior information
4.4.1 When are point null hypotheses reasonable?
As was mentioned in Section 4.3, it is very common in classical statistics to conduct a test of a point (or sharp) null hypothesis
In such a case, the full-scale Bayesian approach (as opposed to the compromise described in the previous section) gives rise to conclusions which differ radically from the classical answers.
Before getting on to the answers, a few basic comments about the whole problem are in order. First, tests of point null hypotheses are often performed in inappropriate circumstances. It will virtually never be the case that one seriously entertains the hypothesis that exactly, a point which classical statisticians fully admit (cf. Lehmann, 1986, Sections 4.5, 5.2). More reasonable would be the null hypothesis
where is so chosen that all can be considered ‘indistinguishable’ from . An example in which this might arise would be an attempt to analyze a chemical by observing some aspect, described by a parameter θ, of its reaction with a known chemical. If it were desired to test whether or not the unknown chemical was a specific compound, with a reaction strength known to an accuracy of ε, it would be reasonable to test
An example where ε might be extremely close to zero is a test for extra-sensory perception (ESP) with representing the hypothesis of no ESP. (The only reason that ε would probably not be zero here is that an experiment designed to test for ESP probably would not lead to a perfectly well-defined .) Of course, there are also many decision problems that would lead to a null hypothesis of the aforementioned form with a large ε, but such problems will rarely be well approximated by testing a point null hypothesis.
The question arises, if a realistic null hypothesis is H, when is it reasonable to approximate it by H? From a Bayesian viewpoint, it will be reasonable if and only when we spread the quantity p0 of prior probability over , the posterior probability is close to that of when a lump of prior probability p0 is concentrated on the single value . This will certainly happen if the likelihood function is approximately constant on , but this is a very strong condition, and one can often get away with less.
4.4.2 A case of nearly constant likelihood
Suppose that are independently where is known. Then we know from Section 2.3 on ‘Several normal observations with a normal prior’ that the likelihood is proportional to an density for θ. Now over the interval this likelihood varies by a factor
It follows that if we define z to be the statistic
used in classical tests of significance, and
then the likelihood varies over by a factor which is at most exp(2k). Hence, provided that ε is reasonably small, there is a useful bound on the variation of the likelihood.
For example, if ε can be taken to be 0.0025 and
then the likelihood varies by at most exp(2k) over . More specifically, if z = 2, and n = 25, then k becomes
and exp(2k)=1.05=1/0.95. In summary, if all values within of are regarded as indistinguishable from , then we can feel reassured that the likelihood function does not vary by more than 5% over this range of indistinguishable values, and if the interval can be made even smaller then the likelihood is even nearer to being constant.
Note that the bound depends on as well as on .
4.4.3 The Bayesian method for point null hypotheses
We shall now develop a theory for testing point null hypotheses, which can then be compared with the classical theory. If there is doubt as to the adequacy of the point null hypothesis as a representation of the real null hypothesis, it is always possible to test an interval null hypothesis directly by Bayesian methods and compare the results (and this will generally be easier than checking the constancy of the likelihood function).
You cannot use a continuous prior density to conduct a test of because that would of necessity give a prior probability of zero and hence a posterior probability of zero. A reasonable way of proceeding is to give a prior probability of and to assign a probability density to values where and integrates to unity. If you are thinking of the hypothesis as an approximation to a hypothesis then is really your prior probability for the whole interval .
You can then derive the predictive density of a vector of observations in the form
Writing
for what might be called the predictive distribution under the alternative hypothesis we see that
It follows that the posterior probabilities are
and so, the Bayes factor is
Of course, it is possible to find the posterior probabilities p0 and p1 in terms of the Bayes factor B and the prior probability as noted in Section 4.1 when hypothesis testing in general was discussed.
4.4.4 Sufficient statistics
Sometimes, we have a sufficient statistic for x given θ, so that
where is not a function of θ. Clearly in such a case,
so that we can cancel a common factor to get
and the Bayes factor is
In short, x can be replaced by t in the formulas for p0, p1 and the Bayes factor B.
Many of the ideas in this section should become clearer when you come to look at Section 4.5, in which the particular case of the normal mean is explored in detail.
4.5 Point null hypotheses for the normal distribution
4.5.1 Calculation of the Bayes’ factor
Suppose is a vector of independently random variables, and that is known. Because of the remarks at the end of the last section, we can work entirely in terms of the sufficient statistic
We have to make some assumption about the density of θ under the alternative hypothesis, and clearly one of the most natural things to do is to suppose that this density is normal, say . Strictly, this should be regarded as a density on values of θ other than , but when probabilities are found by integration of this density, the odd point will make no difference. It will usually seem sensible to take as, presumably, values near to are more likely than those far away, and this assumption will accordingly be made from now on. We note that the standard deviation of the density of θ under the alternative hypothesis is supposed to be considerably greater than the width of the interval of values of θ considered ‘indistinguishable’ from .
It is quite easy to find the predictive distribution of under the alternative, namely,
by writing
as in Section 2.2 on ‘Normal prior and likelihood’. Then because, independently of one another, and , the required density of is .
It follows that the Bayes factor B is
It is now useful to write
for the statistic used in classical tests of significance. With this definition
The posterior probability p0 can now be found in terms of the prior probability and the Bayes factor B by the usual formula
derived in Section 4.1 when we first met hypothesis tests.
4.5.2 Numerical examples
For example, if and , then the values
give rise to a Bayes factor
and hence to a posterior probability
This result is quite extraordinarily different from the conclusion that a classical statistician would arrive at with the same data. Such a person would say that, since z has a sampling distribution that is N(0, 1), a value of z that is, in modulus, 1.96 or greater would arise with probability only 5% (i.e. the two-tailed P-value of z=1.96 is 0.05), and consequently would reject the null hypothesis that at the 5% level. With the above assumptions about prior beliefs, we have, on the contrary, arrived at a posterior probability of 40% that the null hypothesis is true! Some further sample values are as follows (cf. Berger, 1985, Section 4.3):
The results of classical and Bayesian analyses differ more and more as the sample size . For fixed z, it is easy to see that B is asymptotically
and hence . Consequently, 1–p0 is of order and thus . So, with the specified prior, the result that z=1.96, which a classical statistician would regard as just sufficient to result in rejection of the null hypothesis at the 5% level
irrespective of the value of n, can result in an arbitrarily high posterior probability p0 of the null hypothesis. Despite this, beginning users of statistical techniques often get the impression that if some data are significant at the 5% level then in some sense the null hypothesis has a probability after the event of at most 5%.
A specific example of a problem with a large sample size arises in connection with Weldon’s dice data, quoted by Fisher (1925b, Section 18 and Section 23). It transpired that when 12 dice were thrown 26 306 times, the mean and variance of the number of dice showing more than 4 were 4.0524 and 2.6983, as compared with a theoretical mean of for fair dice. Approximating the binomial distribution by a normal distribution leads to a z statistic of
The corresponding two-tailed P-value is approximately where is the density function of the standard normal distribution (cf. Abramowitz and Stegun, 1965, equation 26.2.12), so about 1 in 4 000 000. However, a Bayesian analysis (assuming and as usual) depends on a Bayes factor
and so to a posterior probability of 1 in 4000 that the dice were fair. This is small, but nevertheless the conclusion is not as startling as that which the classical analysis leads to.
4.5.3 Lindley’s paradox
This result is sometimes known as Lindley’s paradox (cf. Bartlett, 1957; Lindley, 1957; Shafer, 1982) and sometimes as Jeffreys’ paradox, because it was in essence known to Jeffreys (see Jeffreys, 1961, Section 5.2), although he did not refer to it as a paradox. A useful recent reference is Berger and Delempady (1987).
It does relate to something which has been noted by users of statistics. Lindley once pointed out (see Zellner, 1974, Section 3.7) that experienced workers often lament that for large sample sizes, say 5000, as encountered in survey data, use of the usual t-statistic and the 5% significance level shows that the values of parameters are usually different from zero and that many of them sense that with such a large sample the 5% level is not the right thing to use, but do not know what else to use (see also Jeffreys, 1961, Appendix B). On the other hand, in many scientific contexts, it is unrealistic to use a very large sample because systematic bias may vitiate it or the observer may tire [see Wilson (1952, Section 9.6) or Baird (1962, Section 2.8)].
Since the result is so different from that found by so many statisticians, it is important to check that it does not depend very precisely on the nature of the prior distribution which led to it.
We assumed that the prior probability of the null hypothesis was , and this assumption does seem ‘natural’ and could be said to be ‘objective’; in any case a slight change in the value of would not make much difference to the qualitative feel of the results.
We also assumed that the prior density of under the alternative hypothesis was normal of mean with some variance ψ. In fact, the precise choice of does not make a great deal of difference unless is large. Lindley (1957) took to be a uniform distribution
over an interval centred on , while Jeffreys (1961, Section 5.2) argues that it should be a Cauchy distribution,that is,
although his arguments are far from overwhelming and do not seem to have convinced anyone else. An examination of their work will show that in general terms they arrive at similar conclusions to those derived earlier.
There is also a scale parameter ψ in the distribution to be decided on (and this is true whether this distribution is normal, uniform or Cauchy). Although it seems reasonable that ψ should be chosen proportional to , there does not seem to be any convincing argument for choosing this to have any particular value (although Jeffreys tries to give a rational argument for the Cauchy form in general, he seems to have no argument for the choice of ψ beyond saying that it should be proportional to ). But it is easily seen that the effect of taking
on B and p0 is just the same as taking if n is multiplied by a factor k. It should be noted that it will not do to let and thus to take as a uniform distribution on the whole real line, because this is equivalent to multiplying n by a factor which tends to and so leads to and . It would clearly not be sensible to use a procedure which always gave the null hypothesis a posterior value of unity. In any case, as Jeffreys points out (1961, Section 5.0), ‘the mere fact that it has been suggested that [θ] is zero corresponds to some presumption that it is fairly small’.
4.5.4 A bound which does not depend on the prior distribution
In fact, it is possible to give a bound on B which does not depend on any assumptions about . We know that
where is the maximum likelihood estimator of θ, that is,
In the case being considered, has a normal distribution of mean θ and hence , so that
It follows that the Bayes factor satisfies
so writing as before, we see that
implying a corresponding lower bound on p0. Some sample values (assuming that ) are as follows:
[cf. Berger, 1985, Section 4.3; Berger further claims that if and z> 1.68 then ]. Note that this bound does not depend on the sample size n and so does not demonstrate Lindley’s paradox.
As an example, if z=1.96 then the Bayes factor B is at least 0.146 and hence the posterior probability of the null hypothesis is at least 0.128. Unlike the results derived earlier assuming a more precise form for , the bounds no longer depend on the sample size, but it should be noted that the conclusion still does not accord at all well with the classical result of significance at the 5% level.
4.5.5 The case of an unknown variance
In the case where is unknown, similar conclusions follow, although there are a few more complications. It will do now harm if the rest of this section is ignored at a first reading (or even at a second).
We need first to find the density . If is unknown, then as was shown in Section 2.12 on ‘Normal mean and variance both unknown’
where . Using a reference prior for , it is easy to integrate out much as was done there to get
where and
It is now necessary to find the predictive density under the alternative hypothesis. To do this, first return to
Assuming a prior , we can integrate θ out; thus,
The last integral is of course proportional to , so to , while a little manipulation shows that
It follows that
To go any further, it is necessary to make some assumption about the relationship between and ψ. If it is assumed that
and a reference prior is used, then the predictive distribution under the alternative hypothesis becomes
where t is the same statistic encountered in the case of the null hypothesis. It follows that the Bayes factor is
and hence, it is possible to find p0 and p1.
It should be noted that as the exponential limit shows that the Bayes factor is asymptotically
which as is the same as in the known variance case.
4.6 The Doogian philosophy
4.6.1 Description of the method
Good (1983, Chapter 4 and elsewhere) has argued in favour of a compromise between Bayesian and non-Bayesian approaches to hypothesis testing. His technique can be summarized as follows (in his own words):
The Bayes/non-Bayes synthesis is the following technique for synthesizing subjective and objective techniques in statistics. (i) We use the neo-Bayes/Laplace philosophy (i.e. the techniques described in Section 4.4 on point null hypotheses with prior information) in order to arrive at a factor F (which is 1/B in the notation used here) in favour of the non-null hypothesis. For the particular case of discrimination between two simple statistical hypotheses, the factor in favour is equal to the likelihood ratio [as was shown in Section 4.1 when hypothesis testing was first considered], but not in general. The neo-Bayes/Laplace philosophy usually works with inequalities between probabilities, but for definiteness we here assume that the initial distributions are taken as precise, though not necessarily uniform. (ii) We then use F as a statistic and try to obtain its distribution on the null hypothesis, and work out its tail area, P. (iii) Finally, we look to see if F lies in the range (1/30P, 3/10P). If it does not lie in this range we think a
gain. (Note that F is here the factor against H.)
This is certainly not unworkable. For example, in Section 4.5 we found that
so that B is a monotonic function of z2, and hence the probability equals the (two-tailed) P-value corresponding to the value of z, which is easily found as z has a standard normal distribution.
4.6.2 Numerical example
Thus if, as in an example discussed in Section 4.5, and the density under the alternative hypothesis has (and so is ), then for
the P-value is P=0.05 and the Bayes factor is B=0.66=1/1.5. Good’s method then asks us to check whether F=1.5 lies in the range (1/30P, 3/10P), that is, (0.67, 6.0). Consequently, we do not in this case need to ‘think again’.
Good attributes this approach to ‘the Tibetan lama K. Caj Doog’, but it does not appear that the lama has made many converts apart from Good himself.