Bayesian Statistics (4th ed)
Page 3
1.2.3 A warning
In the case of Connecticut v. Teal [see DeGroot et al. (1986, p. 9)], a case of alleged discrimination on the basis of a test to determine eligibility for promotion was considered. It turned out that of those taking the test 48 were black (B) and 259 were white (W), so that if we consider a random person taking the test
Of the blacks taking the test, 26 passed (P) and the rest failed (F), whereas of the whites, 206 passed and the rest failed, so that altogether 232 people passed. Hence,
There is a temptation to think that these are the figures which indicate the possibility of discrimination. Now there certainly is a case for saying that there was discrimination in this case, but the figures that should be considered are
It is easily checked that the probabilities here are related by Bayes’ Theorem. It is worth while spending a while playing with hypothetical figures to convince yourself that the fact that is less than is irrelevant to the real question as to whether is less than – it might or might not be depending on the rest of the relevant information, that is on and . The fallacy involved arises as the first of two well-known fallacies in criminal law which are both well summarized by Aitken (1996) (see also Aitken and Taroni, 2004, and Dawid, 1994) as follows:
Suppose a crime has been committed. Blood is found at the scene for which there is no innocent explanation. It is of a type which is present in 1% of the population. The prosecutor may then state:
‘There is a 1% chance that the defendant would have the crime blood type if he were innocent. Thus, there is a 99% chance that he is guilty’.
Alternatively, the defender may state:
‘This crime occurred in a city of 800,000 people. This blood type would be found in approximately 8000 people. The evidence has provided a probability of 1 in 8000 that the defendant is guilty and thus has no relevance.’
The first of these is known as the prosecutor’s fallacy or the fallacy of the transposed conditional and, as pointed out above, in essence it consists in quoting the probability instead of . The two are, however, equal if and only if the prior probability happens to equal (E), which will only rarely be the case.
The second is the defender’s fallacy which consists in quoting without regard to . In the case considered by Aitken, the prior odds in favour of guilt are
while the posterior odds are
Such a large change in the odds is, in Aitken’s words ‘surely of relevance’. But, again in Aitken’s words, ‘Of course, it may not be enough to find the suspect guilty’.
As a matter of fact, Bayesian statistical methods are increasingly used in a legal context. Useful references are Balding and Donnelly (1995), Foreman, Smith and Evett (1997), Gastwirth (1988) and Fienberg (1989).
1.3 Random variables
1.3.1 Discrete random variables
As explained in Section 1.1, there is usually a set Ω representing the possibilities consistent with the sum total of data available to the individual or individuals concerned. Now suppose that with each elementary event ω in Ω, there is an integer which may be positive, negative or zero. In the jargon of mathematics, we have a function mapping Ω to the set of all (signed) integers. We refer to the function as a random variable or an r.v.
A case arising in the context of the very first example we discussed, which was about tossing a red die and a blue die, is the integer representing the sum of the spots showing. In this case, ω might be ‘red three, blue two’ and then would be 5. Another case arising in the context of the second (political) example is the Labour majority (represented as a negative integer should the Conservatives happen to win), and here ω might be ‘Labour 350, Conservative 250’ in which case would be 100.
Rather naughtily, probabilists and statisticians tend not to mention the elementary event ω of which is a function and instead just write for . The reason is that what usually matters is the value of rather than the nature of the elementary event ω, the definition of which is in any case dependent on the context, as noted earlier, in the discussion of elementary events. Thus, we write
for the probability that the random variable takes the particular value m. It is a useful convention to use the same lower-case letter and drop the tilde ( ) to denote a particular value of a random variable. An alternative convention used by some statisticians is to use capital letters for random variables and corresponding lower case letters for typical values of these random variables, but in a Bayesian context we have so many quantities that are regarded as random variables that this convention is too restrictive. Even worse than the habit of dropping mention of ω is the tendency to omit the tilde and so use the same notation for a random variable and for a typical value of it. While failure to mention ω rarely causes any confusion, the failure to distinguish between random variables and typical values of these random variables can, on occasion, result in real confusion. When there is any possibility of confusion, the tilde will be used in the text, but otherwise it will be omitted. Also, we will use
for the probability that the random variable takes the value m. When there is only one random variable we are talking about, this abbreviation presents few problems, but when we have a second random variable and write
then ambiguity can result. It is not clear in such a case what p(5) would mean, or indeed what p(i) would mean (unless it refers to where is yet a third random variable). When it is necessary to resolve such an ambiguity, we will use
so that, for example, is the probability that is 5 and is the probability that equals i. Again, all of this seems very much more confusing than it really is – it is usually possible to conduct arguments quite happily in terms of p(m) and p(n) and substitute numerical values at the end if and when necessary.
You could well object that you would prefer a notation that was free of ambiguity, and if you were to do so, I should have a lot of sympathy. But the fact is that constant references to and rather than to m and p(m) would clutter the page and be unhelpful in another way.
We refer to the sequence (p(m)) as the (probability) density (function) or pdf of the random variable m (strictly ). The random variable is said to have a distribution (of probability) and one way of describing a distribution is by its pdf. Another is by its (cumulative) distribution function, or cdf or df, defined by
Because the pdf has the obvious properties
the df is (weakly) increasing, that is
and moreover
1.3.2 The binomial distribution
A simple example of such a distribution is the binomial distribution (see Appendix A). Suppose, we have a sequence of trials each of which, independently of the others, results in success (S) or failure (F), the probability of success being a constant π (such trials are sometimes called Bernoulli trials). Then the probability of any particular sequence of n trials in which k result in success is
so that allowing for the ways in which k successes and n−k failures can be ordered, the probability that a sequence of n trials results in k successes is
If then k (strictly ) is a random variable defined as the number of successes in n trials, then
Such a distribution is said to be binomial of index n and parameter π, and we write
[or strictly ].
We note that it follows immediately from the definition that if x and y are independent and and then .
1.3.3 Continuous random variables
So far, we have restricted ourselves to random variables which take only integer values. These are particular cases of discrete random variables. Other examples of discrete random variables occur, for example, a measurement to the nearest quarter-inch which is subject to a distribution of error, but these can nearly always be changed to integer-valued random variables (in the given example simply by multiplying by 4). More generally, we can suppose that with each elementary event ω in Ω there is a real number . We can define the (cumulative) distribution function, cdf or df of by
As in the discrete case the df is (weakly) increasing, that is
and mor
eover
It is usually the case that when is not discrete there exists a function p(x), or more strictly , such that
in which case p(x) is called a (probability) density (function) or pdf. When this is so, x (strictly ) is said to have a continuous distribution (or more strictly an absolutely continuous distribution). Of course, in the continuous case p(x) is not itself interpretable directly as a probability, but for small
The quantity is sometimes referred to as the probability element. Note that letting this implies that
for every particular value x, in sharp contrast to the discrete case. We can also use the above approximation if y is some one-to-one function of x, for example
Then if values correspond in an obvious way
which on substituting in the above relationship gives in the limit
which is the rule for change of variable in probability densities. (It is not difficult to see that, because the modulus signs are there, the same result is true if F is a strictly decreasing function of x). Another way of getting at this rule is by differentiating the obvious equation
[strictly which holds whenever y and x are corresponding values, that is y=g(x). We should, however, beware that these results need modification if g is not a one-to-one function. In the continuous case, we can find the density from the df by differentiation, namely
Although there are differences, there are many similarities between the discrete and the continuous cases, which we try to emphasize by using the same notation in both cases. We note that
in the discrete case, but
in the continuous case. The discrete case is slightly simpler in one way in that no complications arise over change of variable, so that
if y and x are corresponding values, that is y=g(x).
1.3.4 The normal distribution
The most important example of a continuous distribution is the so-called normal or Gaussian distribution. We say that z has a standard normal distribution if
and when this is so we write
The density of this distribution is the familiar bell-shaped curve, with about two-thirds of the area between –1 and 1, 95% of the area between –2 and 2 and almost all of it between –3 and 3. Its distribution function is
More generally, we say that x has a normal distribution, denoted
if
where z is as aforementioned, or equivalently if
The normal distribution is encountered almost at every turn in statistics. Partly this is because (despite the fact that its density may seem somewhat barbaric at first sight) it is in many contexts the easiest distribution to work with, but this is not the whole story. The Central Limit Theorem says (roughly) that if a random variable can be expressed as a sum of a large number of components no one of which is likely to be much bigger than the others, these components being approximately independent, then this sum will be approximately normally distributed. Because of this theorem, an observation which has an error contributed to by many minor causes is likely to be normally distributed. Similar reasons can be found for thinking that in many circumstances we would expect observations to be approximately normally distributed, and this turns out to be the case, although there are exceptions. This is especially useful in cases where we want to make inferences about a population mean.
1.3.5 Mixed random variables
While most commonly occurring random variables are discrete or continuous, there are exceptions, for example the time you have to wait until you are served in a queue, which is zero with a positive probability (if the queue is empty when you arrive), but otherwise is spread over a continuous range of values. Such a random variable is said to have a mixed distribution.
1.4 Several random variables
1.4.1 Two discrete random variables
Suppose that with each elementary event ω in Ω, we can associate a pair of integers . We write
Strictly speaking, p(m, n) should be written as for reasons discussed earlier, but this degree of pedantry in the notation is rarely necessary. Clearly
The sequence (p(m, n)) is said to be a bivariate (probability) density (function) or bivariate pdf and is called the joint pdf of the random variables m and n (strictly and ). The corresponding joint distribution function, joint cdf or joint df is
Clearly, the density of m (called its marginal density) is
We can also define a conditional distribution for n given m (strictly for given ) by allowing
to define the conditional (probability) density (function) or conditional pdf. This represents our judgement as to the chance that takes the value n given that is known to have the value m. If it is necessary to make our notation absolutely precise, we can always write
so, for example, is the probability that m is 4 given is 3, but is the probability that is 4 given that takes the value 3, but it should be emphasized that we will not often need to use the subscripts. Evidently
and
We can also define a conditional distribution function or conditional df by
1.4.2 Two continuous random variables
As in Section 1.4, we have begun by restricting ourselves to integer values, which is more or less enough to deal with any discrete cases that arise. More generally, we can suppose that with each elementary event ω in Ω, we can associate a pair of real numbers. In this case, we define the joint distribution function or joint df as
Clearly the df of x is
and that of y is
It is usually the case that when neither x nor y is discrete there is a function p(x, y) such that
in which case p(x, y) is called a joint (probability) density (function) or joint pdf. When this is so, the joint distribution is said to be continuous (or more strictly to be absolutely continuous). We can find the density from the df by
Clearly,
and
The last formula is the continuous analogue of
in the discrete case.
By analogy with the discrete case, we define the conditional density of y given x (strictly of given ) as
provided . We can then define the conditional distribution function by
There are difficulties in the notion of conditioning on the event that because this event has probability zero for every x in the continuous case, and it can help to regard the above distribution as the limit of the distribution which results from conditioning on the event that is between x and , that is
as .
1.4.3 Bayes’ Theorem for random variables
It is worth noting that conditioning the random variable y by the value of x does not change the relative sizes of the probabilities of those pairs (x, y) that can still occur. That is to say, the probability p(y|x) is proportional to p(x, y) and the constant of proportionality is just what is needed, so that the conditional probabilities integrate to unity. Thus,
Moreover,
It is clear that
so that
This is, of course, a form of Bayes’ Theorem, and is in fact the commonest way in which it occurs in this book. Note that it applies equally well if the variables x and y are continuous or if they are discrete. The constant of proportionality is
in the continuous case or
in the discrete case.
1.4.4 Example
A somewhat artificial example of the use of this formula in the continuous case is as follows. Suppose y is the time before the first occurrence of a radioactive decay which is measured by an instrument, but that, because there is a delay built into the mechanism, the decay is recorded as having taken place at a time x> y. We actually have the value of x, but would like to say what we can about the value of y on the basis of this knowledge. We might, for example, have
Then
Often we will find that it is enough to get a result up to a constant of proportionality, but if we need the constant, it is very easy to find it because we know that the integral (or the sum in the discrete case) must be one. Thus, in this case
1.4.5 One discrete variable and one continuous variable
We also encounter cases where we have two random variables, one of which is continuous and one of which is discrete. All the aforementioned definitions and formulae extend in an obvious way to such a case provided we are careful, for example, to use integration for continuous variables but summation for discrete variables. In particular, the formulation
for Bayes’ Theorem is valid in such a case.
It may help to consider an example (again a somewhat artificial one). Suppose k is the number of successes in n Bernoulli trials, so , but that the value of π is unknown, your beliefs about it being uniformly distributed over the interval [0, 1] of possible values. Then