by Peter M Lee
so that
The constant can be found by integration if it is required. Alternatively, a glance at Appendix A will show that, given k, π has a beta distribution
and that the constant of proportionality is the reciprocal of the beta function . Thus, this beta distribution should represent your beliefs about π after you have observed k successes in n trials. This example has a special importance in that it is the one which Bayes himself discussed.
1.4.6 Independent random variables
The idea of independence extends from independence of events to independence of random variables. The basic idea is that y is independent of x if being told that x has any particular value does not affect your beliefs about the value of y. Because of complications involving events of probability zero, it is best to adopt the formal definition that x and y are independent if
for all values x and y. This definition works equally well in the discrete and the continuous cases (and indeed in the case where one random variable is continuous and the other is discrete). It trivially suffices that p(x, y) be a product of a function of x and a function of y.
All the above generalizes in a fairly obvious way to the case of more than two random variables, and the notions of pairwise and mutual independence go through from events to random variables easily enough. However, we will find that we do not often need such generalizations.
1.5 Means and variances
1.5.1 Expectations
Suppose that m is a discrete random variable and that the series
is absolutely convergent, that is such that
Then the sum of the original series is called the mean or expectation of the random variable, and we denote it
A motivation for this definition is as follows. In a large number N of trials, we would expect the value m to occur about p(m)N times, so that the sum total of the values that would occur in these N trials (counted according to their multiplicity) would be about
so that the average value should be about
Thus, we can think of expectation as being, at least in some circumstances, a form of very long term average. On the other hand, there are circumstances in which it is difficult to believe in the possibility of arbitrarily large numbers of trials, so this interpretation is not always available. It can also be thought of as giving the position of the ‘centre of gravity’ of the distribution imagined as a distribution of mass spread along the x-axis.
More generally, if g(m) is a function of the random variable and is absolutely convergent, then its sum is the expectation of g(m). Similarly, if h(m, n) is a function of two random variables m and n and the series is absolutely convergent, then its sum is the expectation of h(m, n). These definitions are consistent in that if we consider g(m) and h(m, n) as random variables with densities of their own, then it is easily shown that we get these values for their expectations.
In the continuous case, we define the expectation of a random variable x by
provided that the integral is absolutely convergent, and more generally define the expectation of a function g(x) of x by
provided that the integral is absolutely convergent, and similarly for the expectation of a function h(x, y) of two random variables. Note that the formulae in the discrete and continuous cases are, as usual, identical except for the use of summation in the one case and integration in the other.
1.5.2 The expectation of a sum and of a product
If x and y are any two random variables, independent or not, and a, b and c are constants, then in the continuous case
and similarly in the discrete case. Yet more generally, if g(x) is a function of x and h(y) a function of y, then
We have already noted that the idea of independence is closely tied up with multiplication, and this is true when it comes to expectations as well. Thus, if x and y are independent, then
and more generally if g(x) and h(y) are functions of independent random variables x and y, then
1.5.3 Variance, precision and standard deviation
We often need a measure of how spread out a distribution is, and for most purposes the most useful such measure is the variance of x, defined by
Clearly if the distribution is very little spread out, then most values are close to one another and so close to their mean, so that is small with high probability and hence is small. Conversely, if the distribution is well spread out then is large. It is sometimes useful to refer to the reciprocal of the variance, which is called the precision. Further, because the variance is essentially quadratic, we sometimes work in terms of its positive square root, the standard deviation, especially in numerical work. It is often useful that
The notion of a variance is analogous to that of a moment of inertia in mechanics, and this formula corresponds to the parallel axes theorem in mechanics. This analogy seldom carries much weight nowadays, because so many of those studying statistics took it up with the purpose of avoiding mechanics.
In discrete cases, it is sometimes useful that
1.5.4 Examples
As an example, suppose that . Then
After a little manipulation, this can be expressed as
Because the sum is a sum of binomial probabilities, this expression reduces to , and so
Similarly,
and so
For a second example, suppose . Then
The integrand in the last expression is an odd function of and so vanishes, so that
Moreover,
so that on writing
Integrating by parts (using z as the part to differentiate), we get
1.5.5 Variance of a sum; covariance and correlation
Sometimes we need to find the variance of a sum of random variables. To do this, note that
where the covariance of x and y is defined by
More generally,
for any constants a, b and c. By considering this expression as a quadratic in a for fixed b or vice versa and noting that (because its value is always positive) this quadratic cannot have two unequal real roots, we see that
We define the correlation coefficient between x and y by
It follows that
and indeed a little further thought shows that if and only if
with probability 1 for some constants a, b and c with a and b having opposite signs, while if and only if the same thing happens except that a and b have the same sign. If we say that x and y are uncorrelated.
It is easily seen that if x and y are independent then
from which it follows that independent random variables are uncorrelated.
The converse is not in general true, but it can be shown that if x and y have a bivariate normal distribution (as described in Appendix A), then they are independent if and only if they are uncorrelated.
It should be noted that if x and y are uncorrelated, and in particular if they are independent
(observe that there is a plus sign on the right-hand side even if there is a minus sign on the left).
1.5.6 Approximations to the mean and variance of a function of a random variable
Very occasionally, it will be useful to have an approximation to the mean and variance of a function of a random variable. Suppose that
Then if g is a reasonably smooth function and x is not too far from its expectation, Taylor’s theorem implies that
It, therefore, seems reasonable that a fair approximation to the expectation of z is given by
and if this is so, then a reasonable approximation to may well be given by
As an example, suppose that
and that z=g(x), where
so that
and thus . The aforementioned argument then implies that
The interesting thing about this transformation, which has a long history [see Eisenhart et al. (1947, Chapter 16) and Fisher (1954)], is that, to the extent to which the approximation is valid, the variance of z does not depend on the parameter π. It is accordingly known as a variance-stabilizing transformation. We will return to this transformation in Section 3.2 on
the ‘Reference Prior for the Binomial Distribution’.
1.5.7 Conditional expectations and variances
If the reader wishes, the following may be omitted on a first reading and then returned to as needed.
We define the conditional expectation of y given x by
in the continuous case and by the corresponding sum in the discrete case. If we wish to be pedantic, it can occasionally be useful to indicate what we are averaging over by writing
just as we can write , but this is rarely necessary (though it can slightly clarify a proof on occasion). More generally, the conditional expectation of a function g(y) of y given x is
We can also define a conditional varianceas
Despite some notational complexity, this is easy enough to find since after all a conditional distribution is just a particular case of a probability distribution. If we are really pedantic, then is a real number which is a function of the real number x, while is a random variable which is a function of the random variable , which takes the value when takes the value x. However, the distinction, which is hard to grasp in the first place, is usually unimportant.
We may note that the formula
could be written as
but we must be careful that it is an expectation over values of (i.e. ) that occurs here.
Very occasionally we make use of results like
The proofs are possibly more confusing than helpful. They run as follows:
Similarly, we get the generalization
and in particular
hence
while
from which it follows that
1.5.8 Medians and modes
The mean is not the only measure of the centre of a distribution. We also need to consider the median from time to time, which is defined as any value x0 such that
In the case of most continuous random variables there is a unique median such that
We occasionally refer also to the mode, defined as that value at which the pdf is a maximum. One important use we shall have for the mode will be in methods for finding the median based on the approximation
or equivalently
(see the preliminary remarks in Appendix A).
1.6 Exercises on Chapter 1
1. A card came is played with 52 cards divided equally between four players, North, South, East and West, all arrangements being equally likely. Thirteen of the cards are referred to as trumps. If you know that North and South have ten trumps between them, what is the probability that all three remaining trumps are in the same hand? If it is known that the king of trumps is included among the other three, what is the probability that one player has the king and the other the remaining two trumps?
2. a. Under what circumstances is an event A independent of itself?
b. By considering events concerned with independent tosses of a red die and a blue die, or otherwise. give examples of events A, B and C which are not independent, but nevertheless are such that every pair of them is independent.
c. By considering events concerned with three independent tosses of a coin and supposing that A and B both represent tossing a head on the first trial, give examples of events A, B and C which are such that although no pair of them is independent.
3. Whether certain mice are black or brown depends on a pair of genes, each of which is either B or b. If both members of the pair are alike, the mouse is said to be homozygous, and if they are different it is said to be heterozygous bb. The mouse is brown only if it is homozygous bb. The offspring of a pair of mice have two such genes, one from each parent, and if the parent is heterozygous, the inherited gene is equally likely to be B or b. Suppose that a black mouse results from a mating between two heterozygotes. a. What are the probabilities that this mouse is homozygous and that it is heterozygous?
Now suppose that this mouse is mated with a brown mouse, resulting in seven offspring, all of which turn out to be black.
b. Use Bayes’ Theorem to find the probability that the black mouse was homozygous BB.
c. Recalculate the same probability by regarding the seven offspring as seven observations made sequentially, treating the posterior after each observation as the prior for the next (cf. Fisher, 1959, Section II.2).
4. The example on Bayes’ Theorem in Section 1.2 concerning the biology of twins was based on the assumption that births of boys and girls occur equally frequently, and yet it has been known for a very long time that fewer girls are born than boys (cf. Arbuthnot, 1710). Suppose that the probability of a girl is p, so that
Find the proportion of monozygotic twins in the whole population of twins in terms of p and the sex distribution among all twins.
5. Suppose a red and a blue die are tossed. Let x be the sum of the number showing on the red die and twice the number showing on the blue die. Find the density function and the distribution function of x.
6. Suppose that where n is large and π is small but has an intermediate value. Use the exponential limit to show that and . Extend this result to show that k is such that
that is, k is approximately distributed as a Poisson variable of mean λ (cf. Appendix A).
7. Suppose that and have independent Poisson distributions of means and respectively (see question 6) and that k = m + n. a. Show that and .
b. Generalize by showing that k has a Poisson distribution of mean .
c. Show that conditional on k, the distribution of m is binomial of index k and parameter .
8. Modify the formula for the density of a one-to-one function g(x) of a random variable x to find an expression for the density of x2 in terms of that of x, in both the continuous and discrete case. Hence, show that the square of a standard normal density has a chi-squared density on one degree of freedom as defined in Appendix A.
9. Suppose that are independently and all have the same continuous distribution, with density f(x) and distribution function F(x). Find the distribution functions of
in terms of F(x), and so find expressions for the density functions of M and m.
10. Suppose that u and v are independently uniformly distributed on the interval [0, 1], so that the divide the interval into three sub-intervals. Find the joint density function of the lengths of the first two sub-intervals.
11. Show that two continuous random variables x and y are independent (i.e. p(x, y)=p(x)p(y) for all x and y) if and only if their joint distribution function F(x, y) satisfies F(x, y)=F(x)F(y) for all x and y. Prove that the same thing is true for discrete random variables. [This is an example of a result which is easier to prove in the continuous case.]
12. Suppose that the random variable x has a negative binomial distribution of index n and parameter π, so that
Find the mean and variance of x and check that your answer agrees with that given in Appendix A.
13. A random variable X is said to have a chi-squared distribution on degrees of freedom if it has the same distribution as
where Z1, Z2, , are independent standard normal variates. Use the facts that , and to find the mean and variance of X. Confirm these values using the probability density of X, which is
(see Appendix A).
14. The skewness of a random variable x is defined as where
(but note that some authors work in terms of ). Find the skewness of a random variable X with a binomial distribution of index n and parameter π.
15. Suppose that a continuous random variable X has mean μ and variance . By writing
and using a lower bound for the integrand in the latter integral, prove that
Show that the result also holds for discrete random variables. [This result is known as Čebyšev’s Inequality (the name is spelt in many other ways, including Chebyshev and Tchebycheff).]
16. Suppose that x and y are such that
Show that x and y are uncorrelated but that they are not independent.
17. Let x and y have a bivariate normal distribution and suppose that x and y both have mean 0 and variance 1, so that their marginal distributions are standard
normal and their joint density is
Show that if the correlation coefficient between x and y is ρ, then that between x2 and y2 is .
18. Suppose that x has a Poisson distribution (see question 6) P(λ) of mean λ and that, for given x, y, has a binomial distribution B(x, π) of index x and parameter π. a. Show that the unconditional distribution of y is Poisson of mean
b. Verify that the formula
derived in Section 1.5 holds in this case.
19. Define
and show (by setting z=xy and then substituting z for y) that
Deduce that
By substituting (1+x2)z2=2t, so that show that so that the density of the standard normal distribution as defined in Section 1.3 does integrate to unity and so is indeed a density. (This method is due to Laplace, 1812, Section 24.)
2
Bayesian inference for the normal distribution
2.1 Nature of Bayesian inference