by Peter M Lee
3.6 Reference prior for the uniform distribution
3.6.1 Lower limit of the interval fixed
If consists of independent random variables with distributions, then
where , so that the likelihood can be written in the form
after multiplying by a constant (as far as θ is concerned). Hence,
with
It follows that the likelihood is data translated, and the general argument about data translated likelihoods in Section 2.5 now suggests that we take a prior which is at least locally uniform in , that is . In terms of the parameter θ, the usual change of variable rule shows that this means
which is the same prior that is conventionally used for variances. This is not a coincidence, but represents the fact that both are measures of spread (strictly, the standard deviation is more closely analogous to θ in this case, but a prior for the variance proportional to the reciprocal of the variance corresponds to a prior for the standard deviation proportional to the reciprocal of the standard deviation). As the density of is proportional to , the density proportional to can be regarded as the limit Pa(0, 0) of as and . Certainly, if the likelihood is , then the posterior is Pa(M, n) which is what would be expected when the general rule is applied to the particular case of a Pa(0, 0) prior.
3.6.2 Example
A very artificial example can be obtained by taking groups of random digits from Neave (1978, Table 7.1) ignoring all values greater than some value θ [an alternative source of random digits is Lindley and Scott (1995, Table 27)]. A sample of 10 such values is:
This sample was constructed this sample using the value , but we want to investigate how far this method succeeds in giving information about θ, so we note that the posterior is Pa(0.620 58, 10). Since the density function of a Pareto distribution decreases monotonically beyond , an HDR must be of the form for some x, and since the distribution function is (see Appendix A)
a 90% HDR for θ is (0.620 58, x) where x is such that
and so is 0.781 26. Thus, a 90% HDR for θ is the interval (0.62, 0.78). We can see that the true value of θ in this artificial example does turn out to lie in the 90% HDR.
3.6.3 Both limits unknown
In the two parameter case, when where are both unknown and x is any one of the observations, note that it is easily shown that
Very similar arguments to those used in the case of the normal distribution with mean and variance both unknown in Section 2.12 can now be deployed to suggest independent priors uniform in θ and , so that
But
so that this corresponds to
It may be noted that the density of is proportional to
so that in some sense a density
might be regarded as a limit of as and . Integrating over a uniform prior for , which might well seem reasonable, gives
If the likelihood takes the form
then the posterior from this prior is Pabb(M, m, n–1). Thus, our reference prior could be regarded as a distribution, and if we think of it as such, the same formulae as before can be used.
The rule or corresponds to which is the prior Jeffreys’ rule gave us in the normal case with both parameters unknown (see Section 3.3 on Jeffreys’ rule).
3.7 The tramcar problem
3.7.1 The discrete uniform distribution
Occasionally, we encounter problems to do with the discrete uniform distribution. We say that x has a discrete uniform distribution on and write
if
One context in which it arises was cited by Jeffreys (1961, Section 4.8). He says,
The following problem was suggested to me several years ago by Professor M. H. A. Newman. A man travelling in a foreign country has to change trains at a junction and goes into the town, of the existence of which he has only just heard. He has no idea of its size. The first thing that he sees is a tramcar numbered 100. What can he infer about the number of tramcars in the town? It may be assumed for the purpose that they are numbered consecutively from 1 upwards.
Clearly, if there are tramcars in the town and you are equally likely to see any one of the tramcars, then the number n of the car you observe has a discrete uniform distribution . Jeffreys suggests that (assuming is not too small) we can deal with this problem by analogy with problems involving a continuous distribution . In the absence of prior information, the arguments of Section 3.7suggest a reference prior in the latter case, so his suggestion is that the prior for in a problem involving a discrete uniform distribution should be, at least approximately, proportional to . But if
then by Bayes’ Theorem
It follows that
In particular, the posterior probability that is approximately
Approximating the sums by integrals and noting that , this is approximately . Consequently, the posterior median is 2n, and so 200 if you observed tramcar number 100.1
The argument seems rather unconvincing, because it puts quite a lot of weight on the prior as opposed to the likelihood and yet the arguments for the prior are not all that strong, but we may agree with Jeffreys that it may be ‘worth recording’. It is hard to take the reference prior suggested terribly seriously, although if you had a lot more data, then it would not matter what prior you took.
3.8 The first digit problem; invariant priors
3.8.1 A prior in search of an explanation
The problem we are going to consider in this section is not really one of statistical inference as such. What is introduced here is another argument that can sometimes be taken into account in deriving a prior distribution – that of invariance. To introduce the notion, we consider a population which appears to be invariant in a particular sense.
3.8.2 The problem
The problem we are going to consider in this section has a long history going back to Newcomb (1881). Recent references include Knuth (1969, Section 4.2.4B), Raimi (1976) and Turner (1987).
Newcomb’s basic observation, in the days where large tables of logarithms were in frequent use, was that the early pages of such tables tended to look dirtier and more worn than the later ones. This appears to suggest that numbers whose logarithms we need to find are more likely to have 1 as their first digit than 9. If you then look up a few tables of physical constants, you can get some idea as to whether this is borne out. For example, Whitaker’s Almanack (1988, p. 202) quotes the areas of 40 European countries (in square kilometres) as
28 778; 453; 83 849; 30 513; 110 912; 9251; 127 869; 43 069; 1399; 337 032; 547 026; 108 178; 248 577; 6; 131 944; 93 030; 103 000; 70 283; 301 225; 157; 2586; 316; 1; 40 844; 324 219; 312 677; 92 082; 237 500; 61; 504 782; 449 964; 41 293; 23 623; 130 439; 20 768; 78 772; 14 121; 5 571 000; 0.44; 255 804.
The first significant digits of these are distributed as follows:
We will see that there are grounds for thinking that the distribution should be approximately as follows:
3.8.3 A solution
An argument for this distribution runs as follows. The quantities we measure are generally measured in an arbitrary scale, and we would expect that if we measured them in another scale (thus in the case of the aforementioned example, we might measure areas in square miles instead of square kilometres), then the population of values (or at least of their first significant figures) would look much the same, although individual values would of course change. This implies that if θ is a randomly chosen constant, then for any fixed c the transformation
should leave the probability distribution of values of constants alone. This means that if the functional form of the density of values of θ is
then the corresponding density of values of will be
Using the usual change-of-variable rule, we know that , so that we are entitled to deduce that
But if is any function such that for all c and θ, then we may take to see that , so that . It seems, therefore, that the distribution of constants that are likely to arise in a scientific context should, at least approximately, satisfy
Naturally, the reservations expressed in Section 2.5
on locally uniform priors about the use of improper priors as representing genuine prior beliefs over a whole infinite range still apply. But it is possible to regard the prior for such constants as valid over any interval (a, b) where which is not too large. So consider those constants between
Because
the prior density for constants θ between a and b is
and so the probability that such a constant has first digit d, that is, that it lies between da and (d+1)a, is
Since this is true for all values of k, and any constant lies between 10k and 10k+1 for some k, it seems reasonable to conclude that the probability that a physical constant has first digit d is approximately
which is the density tabulated earlier. This is sometimes known as Benford’s Law because of the work of Benford (1938) on this problem.
This sub-section was headed ‘A solution’ rather than ‘The solution’ because a number of other reasons for this density have been adduced. Nevertheless, it is quite an interesting solution. It also leads us into the whole notion of invariant priors.
It has been noted that falsified data is rarely adjusted so as to comply with Benford’s Law and this has been proposed as a method of detecting such data in, for example, clinical trials (see Weir and Murray, 2011). Recently Rauch et al. (2011) pointed out that deficit data reported to Eurostat by Greece demonstrated that Greek data relevant to the euro deficit criteria showed the greatest deviation from Benford’s Law, and that this fact should have given rise to suspicion.
Benford’s Law can be related to another empirical law, Zipf’s Law, originally proposed by Zipf (1935), states that the relative frequency of the kth most common word in a list of n words is approximately proportional to 1/k, so that the most frequent word will occur approximately twice as often as the second most frequent word, three times as often as the third most frequent word, and so on. The relationship between this law and Benford’s Law is explored in Pietronero et al. (2001).
3.8.4 Haar priors
It is sometimes the case that your prior beliefs about a parameter θ are in some sense symmetric. Now when a mathematician hears of symmetry, he or she tends immediately to think of groups, and the notions aforementioned generalize very easily to general symmetry groups. If the parameter values θ can be thought of as members of an abstract group ΘΘ, then the fact that your prior beliefs about θ are not altered when the values of θ are all multiplied by the same value c can be expressed by saying that the transformation
should leave the probability distribution of values of the parameter alone. A density which is unaltered by this operation for arbitrary values of c is known as a Haar measure or, in this context, as a Haar prior or an invariant prior. Such priors are, in general, unique (at least up to multiplicative constants about which there is an arbitrariness if the priors are improper). This is just the condition used earlier to deduce Benford’s Law, except that is now to be interpreted in terms of the multiplicative operation of the symmetry group, which will not, in general, be ordinary multiplication.
This gives another argument for a uniform prior for the mean θ of a normal distribution of known variance, since it might well seem that adding the same constant to all possible values of the mean would leave your prior beliefs unaltered – there seems to be a symmetry under additive operations. If this is so, then the transformation
should leave the functional form of the prior density for θ unchanged, and it is easy to see that this is the case if and only if is constant. A similar argument about the multiplicative group might be used about an unknown variance when the mean is known to produce the usual reference prior . A good discussion of this approach and some references can be found in Berger (1985, Section 3.3.2).
3.9 The circular normal distribution
3.9.1 Distributions on the circle
In this section, the variable is an angle running from to , that is, from 0 to radians. Such variables occur in a number of contexts, for example in connection with the homing ability of birds and in various problems in astronomy and crystallography. Useful references for such problems are Mardia (1972), Mardia and Jupp (2001), and Batschelet (1981). The method used here is a naïve numerical integration technique; for a modern approach using Monte Carlo Markov Chain (MCMC) methods, see Damian and Walker (1999).
The only distribution for such angles which will be considered is the so-called circular normal or von Mises’ distribution. An angle is said to have such a distribution with mean direction μ and concentration parameter κ if
and when this is so we write
The function is the modified Bessel function of the first kind and order zero, but as far as we are concerned it may as well be regarded as defined by
It is tabulated in many standard tables, for example, British Association (1937) or Abramowitz and Stegun (1965, Section 9.7.1). It can be shown that
The circular normal distribution was originally introduced by von Mises (1918). It plays a prominent role in statistical inference on the circle and in that context its importance is almost the same as that of the normal distribution on the line. There is a relationship with the normal distribution, since as the distribution of
approaches the standard normal form N(0, 1) and hence is approximately . It follows that the concentration parameter is analogous to the precision of a normal distribution. This is related to the fact that asymptotically for large κ
However, the equivalent of the Central Limit Theorem does not result in convergence to the circular normal distribution. Further, the circular normal distribution is not in the exponential family. It should not be confused with the so-called wrapped normal distribution.
The likelihood of n observations from an distribution is
so that if we define
then (c, s) is sufficient for given , and indeed
If we define
then we get
and hence
(It may be worth noting that it can be shown by differentiating with respect to the that ρ is a maximum when all the observations are equal and that it then equals unity.) It is easy enough now to construct a family of conjugate priors, but for simplicity let us consider a reference prior
It seems reasonable enough to take a uniform prior in μ and to take independent priors for μ and κ, but it is not so clear that a uniform prior in κ is sensible. Schmitt (1969, Section 10.2) argues that a uniform prior in κ is a sensible compromise and notes that there are difficulties in using a prior proportional to since, unlike the precision of a normal variable, the concentration parameter of a circular normal distribution can actually equal zero. If this is taken as the prior, then of course
3.9.2 Example
Batschelet (1981, Example 4.3.1) quotes data on the time of day of major traffic accidents in a major city. In an obvious sense, the time of day can be regarded as a circular measure, and it is meaningful to ask what is the mean time of day at which accidents occur and how tightly clustered about this time these times are. Writing
the n = 21 observations are as follows:
This results in and and so (allowing for the signs of c and s) (or in terms of a time scale 16h 34m) and so the posterior density takes the form
where ρ and μ take these values. It is, however, difficult to understand what this means without experience of this distribution, and yet there is no simple way of finding HDRs. This, indeed, is one reason why a consideration of the circular normal distribution has been included, since it serves to emphasize that there are cases where it is difficult if not impossible to avoid numerical integration.
3.9.3 Construction of an HDR by numerical integration
By writing and taking the first few terms in the power series quoted earlier for , we see that for
to within 0.002. We can thus deduce some values for and , namely,
As n = 21, this implies that (ignoring the constant) the posterior density for , 0.5, 1.0, 1.5, and 2.0 and for values of μ at intervals from is
In order to say anything about the ma
rginal density of μ, we need to integrate out κ. In order to do this, we can use Simpson’s Rule. Using this rule, the integral of a function between a and b can be approximated by the sum
where the xi are equally spaced with x0=a and x4=b. Applying it to the aforementioned figures, we can say that very roughly the density of μ is proportional to the following values:
Integrating over intervals of values of μ using the (even more crude) approximation
(and taking the densities below 158 and above 338 to be negligible) the probabilities that μ lies in intervals centred on various values are proportional to the values stated:
It follows that the probability that μ lies in the range (203, 293) is about 256/552=0.46, and thus this interval is close to being a 45% HDR.
3.9.4 Remarks
The main purpose of this section is to show in some detail, albeit with very crude numerical methods, how a Bayesian approach can deal with a problem which does not lead to a neat posterior distribution values of which are tabulated and readily available. In practice, if you need to approach such a problem, you would have to have recourse to numerical integration techniques on a computer, probably using MCMC as mentioned at the start of this section, but the basic ideas would be much the same.