Statistical Inference as Severe Testing

Home > Other > Statistical Inference as Severe Testing > Page 11
Statistical Inference as Severe Testing Page 11

by Deborah G Mayo


  (Popper 1959 , p. 396; substituting Pr for P )

  If your suitcase rings the alarm at an airport, this might slightly increase the probability of its containing a weapon, and slightly decrease the probability that it’ s clean. But the probability it contains a weapon is so small that the probability it’ s clean remains high, even if it makes the alarm go off. These facts illustrate a tension between two ways a probabilist might use probability to measure confirmation. A test of a philosophical confirmation theory is whether it elucidates or is even in sync with intuitive methodological principles about evidence or testing. Which, if either, fits with intuitions?

  The most familiar interpretation is that H is confirmed by x if x gives a boost to the probability of H, incremental confirmation. The components of C (H , x ) are allowed to be any statements, and, in identifying C with Pr, no reference to a probability model is required. There is typically a background variable k , so that x confirms H relative to k : to the extent that Pr(H | x and k ) > Pr(H and k ). However, for readability, I will drop the explicit inclusion of k . More generally, if H entails x , then assuming Pr( x ) ≠ 1 and Pr(H ) ≠ 0 , we have Pr(H | x ) > Pr(H ) . This is an instance of probabilistic affirming the consequent. (Note: if Pr(H | x ) > Pr(H ) then Pr( x |H ) > Pr( x ) .)

  (1) Incremental (B-boost): H is confirmed by x iff Pr(H | x ) > Pr(H ),

  H is disconfirmed iff Pr(H | x ) < Pr(H ).

  (“ iff” denotes if and only if.) Also plausible is an absolute interpretation:

  (2) Absolute : H is confirmed by x iff Pr(H | x ) is high, at least greater than Pr(~H | x ).

  Since Pr(~H | x ) = 1 − Pr(H | x ), (2) is the same as defining x confirms H : Pr(H | x ) > 0.5.

  From (1), x (the alarm) dis confirms the hypothesis H : the bag is clean, because its probability has gone down, however slightly. Yet from (2) x confirms H : bag is clean, because Pr(H ) is high to begin with.

  There’ s a conflict. Thus, if (1) seems plausible, then probability, Pr(H | x ), isn’ t a satisfactory way to define confirmation. At the very least, we must distinguish between an incremental and an absolute measure of confirmation for H. No surprise there. From the start Carnap recognized that “ the verb ‘ to confirm’ is ambiguous” ; Carnap and most others choose the “ making firmer” or incremental connotation as better capturing what is meant than that of “ making firm” (Carnap 1962 , p. xviii). Incremental confirmation is generally used in current Bayesian epistemology. Confirmation is a B-boost.

  The first point Popper’ s making in the epigraph is this: to identify confirmation and probability “ C = Pr” leads to this type of conflict. His example is a single toss of a homogeneous die: The data x : an even number occurs; hypothesis H : a 6 will occur. It’ s given that Pr(H ) = 1/6, Pr( x ) = 1/2 . The probability of H is increased by data x , while ~H is undermined by x (its probability goes from 5/6 to 4/6). If we identify probability with degree of confirmation, x confirms H and disconfirms ~H . However, Pr(H | x ) < Pr(~H | x ) . So H is less well confirmed given x than is ~H , in the sense of (2). Here’ s how Popper puts it, addressing Carnap: How can we say H is confirmed by x , while ~H is not; but at the same time ~H is confirmed to a higher degree with x than is H ? (Popper 1959 , p. 390). 2

  Moreover, Popper continues, confirmation theorists don’ t use Pr(H | x ) alone (as they would if C = Pr), but myriad functions of probability to capture how much x has firmed up H . A number of measures offer themselves for the job. A simple B-boost would report the ratio R: , which in Popper’ s example is 2. Or we can use the likelihood ratio of H compared to ~H . Since I used LR in Excursion 1 , where the two hypotheses are not exhaustive, let’ s write [LR] to denote

  Many other ways of measuring the increase in confirmation that x affords H could do as well. (For some excellent lists see Popper 1959 and Fitelson 2002 .)

  What shall we say about the numbers like 2, 2.5? Do they mean the same thing in different contexts? Then there’ s the question of computing Pr( x |~H ), the catchall factor . It doesn’ t offer problems in this case because ~H , the catchall hypothesis , is just an event statement. It’ s far more problematic once we move to genuine statistical hypotheses. Recall how Royall’ s Likelihoodist avoids the composite catchall factor by restricting his likelihood ratios to two simple statistical hypotheses.

  Popper’ s second point is that “ the probability of a statement … simply does not express an appraisal of the severity of the tests a theory has passed, or of the manner in which it has passed these tests” (pp. 394– 5). Ultimately, Popper denies that severity can be completely formalized by any C function. Is there nothing in between a pure formal-syntactical approach and leaving terms at a vague level? I say there is.

  Consider for a moment philosopher Peter Achinstein – a Carnap student. Achinstein (2000 , 2001 ) declared that scientists should not take seriously philosophical accounts of confirmation because they make it too easy to confirm. Furthermore, scientists look to empirical grounds for confirmation, whereas philosophical accounts give us formal (non-empirical) a priori measures. (I call it Achinstein’ s “ Dean’ s problem” because he made the confession to a Dean asking about the relevance of philosophy – not usually the best way to keep funding for philosophy.) Achinstein rejects confirmation as increased firmness, denying it is either necessary or sufficient for evidence (rejects (1)). 3 He requires for H to be confirmed by x that the posterior of H given x be rather high, a version of (2): Pr(H | x ) ≫ Pr(~H | x ), but that’ s not all. He requires that, before we apply confirmation measures, the components have an appropriate explanatory relationship to each other. Yet this requires an adequate way to make explanatory inferences before getting started. It’ s not clear how the formalism helped. He still considers himself a Bayesian epistemologist – a term that has replaced confirmation theorist – but the probabilistic representation threatens to be mostly a kind of bookkeeping for inferential work done in some other way.

  Achinstein is right to object that (1) incremental confirmation makes it too easy to have evidence. After all, J : Mike drowns in the Pacific Ocean, entails x : there is a Pacific Ocean; yet x does not seem to be evidence for J . Still the generally favored position is to view confirmation as (1) a B-boost.

  Exhibit (iv): Paradox of Irrelevant Conjunctions.

  Consider a famous argument due to Glymour (1980 ). If we allow that x confirms H so long as Pr(H | x ) > Pr(H ) , it seems everything confirms everything, so long as one thing is confirmed!

  The first piece of the argument is the problem of irrelevant conjunctions – also called the “ tacking paradox.” If x confirms H , then x also confirms (H & J ), even if hypothesis J is just “ tacked on” to H . As with most of these chestnuts, there is a long history (e.g., Earman 1992 , Rosenkrantz 1977 ) but I consider a leading contemporary representative, Branden Fitelson. Fitelson (2002 ) and Hawthorne and Fitelson (2004 ) define the statement “ J is an irrelevant conjunct to H , with respect to evidence x ” as meaning Pr( x |J ) = Pr( x |J &H ) . For instance, x might be radioastronomic data in support of

  H : the General Theory of Relativity (GTR) deflection of light effect is 1.75″ and

  J : the radioactivity of the Fukushima water being dumped in the Pacific Ocean is within acceptable levels.

  (A) If x confirms H , then x confirms (H & J ), where Pr( x |H & J ) = Pr( x |H ) for any J consistent with H .

  The reasoning is as follows:

  (i) Pr( x |H )/Pr( x ) > 1( x Bayesian-confirms H ).

  (ii) Pr( x |H & J ) = Pr( x |H ) (J ’ s irrelevance is given).

  Substituting (ii) into (i) gives Pr( x |H & J )/Pr( x ) > 1.

  Therefore x Bayesian-confirms (H & J ). 4

  However, it is also plausible to hold what philosophers call the “ special consequence” condition: If x confirms a claim W , and W entails J , then x confirms J . In particular:

  (B) If x confirms (H & J ), then x confirms J .

  (B) gives the second piece of the argument. From (A) and (B) we
have, if x confirms H , then x confirms J for any irrelevant J consistent with H (neither H nor J have probabilities 0 or 1).

  It follows that if x confirms any H , then x confirms any J .

  This absurd result, however, assumed (B) (special consequence) and most Bayesian epistemologists reject it. This is the gist of Fitelson’ s solution to tacking, updated in Hawthorne and Fitelson (2004 ). It is granted that x confirms the conjunction (H & J ), while denying x confirms the irrelevant conjunct J . Aren’ t they uncomfortable with (A), allowing (H & J ) to be confirmed by x ?

  I’ m inclined to agree with Glymour that we are not too happy with an account of evidence that tells us deflection of light data confirms the conjunction of the GTR deflection and the radioactivity of the Fukushima water is within acceptable levels, while assuring us that x does not confirm the conjunct, that the Fukushima water has acceptable levels of radiation (1980 , p. 31). Moreover, suppose we measure the confirmation boost by

  R: Pr(H | x )/Pr( x ).

  Then, Fitelson points out, the conjunction (H & J ) is just as well confirmed by x as is H !

  However, granting confirmation is an incremental B-boost doesn’ t commit you to measuring it by R. The conjunction (H & J ) gets less of a confirmation boost than does H if we use, instead of R, the likelihood ratio [LR] of H against ~H :

  [LR]: Pr( x |H )/Pr( x |~H ). 5

  This avoids the counterintuitive result, or so it is claimed. (Note: Pr(H | x ) > Pr(H ) iff Pr( x |H ) > Pr( x ) , but measuring the boost by R differs from measuring it with [LR].)

  What Does the Severity Account Say?

  Our account of inference disembarked way back at (1): that x confirms H so long as Pr(H | x ) > Pr(H ). That is, we reject probabilistic affirming the consequent. In the simplest case, H entails x , and x is observed. (We assume the probabilities are well defined, and H doesn’ t already have probability 1.) H gets a B-boost, but there are many other “ explanations” of x . It’ s the same reason we reject the Law of Likelihood (LL). Unless stringent probing has occurred, finding an H that fits x is not difficult to achieve even if H is false. H hasn’ t passed severely. Now severely passing is obviously stronger than merely finding some evidence for H , and the confirmation theorist is only saying a B-boost suffices for some evidence. To us, to have any evidence, or even the weaker notion of an “ indication,” requires a minimal threshold of severity be met.

  How about tacking? As always, the error statistician needs to know the relevant properties of the test procedure or rule, and just handing me the H ’ s, x ’ s, and relative probabilities will not suffice. The process of tacking, at least one form, is this – once you have an incrementally confirmed H with data x , tack on any consistent J and announce “ x confirms (H & J ).” Let’ s allow that (H & J ) fits or accords with x (since GTR entails or renders probable the deflection data x ). However, the very claim: “ (H & J ) is confirmed by x ” has been subjected to a radically non-risky test. Nothing has been done to measure the radioactivity of the Fukushima water being dumped into the ocean. B-boosters might reply, “ We’ re admitting J is irrelevant and gets no confirmation,” but our testing intuitions tell us then it’ s crazy to regard (H & J ) as confirmed. They will point out other examples where this doesn’ t seem crazy. But what matters is that it’ s being permitted in general.

  We should punish a claim to have evidence for H with a tacked-on J , when nothing has been done to refute J . Imagine the chaos. Are we to allow positive trial data on diabetes patients given drug D to confirm the claim that D improves survival of diabetes patients and Roche’ s artificial knee is effective, when there’ s only evidence for one? If the confirmation theorist simply stipulates that (1) defines confirmation, then it’ s within your rights to deny it captures ordinary notions of evidence. On the other hand, if you do accept (1), then why are you bothered at all by tacking? Many are not.

  Patrick Maher (2004 ) argues that if B-boosting is confirmation, then there is nothing counterintuitive about data confirming irrelevant conjunctions; Fitelson should not even be conceding “ he bites the bullet.” It makes sense that (H & J ) increases the probability assignment to x just as much as does H , for J the irrelevant conjunct. The supposition that this is problematic and that therefore one must move away from R: Pr( x |H )/Pr( x ) sits uneasily with the fact that R > 1 is just what confirmation boost means. Rather than “ solve” the problem by saying we can measure boost so that (H & J ) gets less confirmation than H , using [LR], why not see it as what’ s meant by an irrelevant conjunct J: J doesn’ t improve the ability to predict x . Other philosophers working in this arena, Crupi and Tentori (2010 ), notice that [LR] is not without problems. In particular, if x dis confirms hypothesis Q , then (Q & J ) isn’ t as badly disconfirmed as Q is, for irrelevant conjunct J . Just as (H & J ) gets less of a B-boost than does H , (Q & J ) gets less disconfirmation in the case where x disconfirms J . This too makes sense on the [LR] measure, though I will spare the details. Their intuitions are that this is worse than the irrelevant conjunction case, and is not solved by the use of [LR]. Interesting new measures are offered. Again, this seems to our tester to reflect the tension between Bayes boosts and good tests.

  What They Call Confirmation We Call Mere “ Fit” or “ Accordance”

  In opposition to [the] inductivist attitude, I assert that C (H , x ) must not be interpreted as the degree of corroboration of H by x , unless x reports the results of our sincere efforts to overthrow H . The requirement of sincerity cannot be formalized – no more than the inductivist requirement that x must represent our total observational knowledge.

  (Popper 1959 , p. 418, substituting H for h ; x for e )

  Sincerity! Popper never held that severe tests turned on a psychological notion, but he was at a loss to formalize severity. A fuller passage from Popper (1959 ) is worth reading if you get a chance. 6 All the measures of confirmation, be it R or LR, or one of the others, count merely as “ fit” or “ accordance” measures to Popper and to the severe tester. They may each be relevant for different problems – that there are different dimensions for fit is to be expected. These measures do not capture what’ s needed to determine if much (or anything) has been done to find H is flawed. What we need to add are the associated error probabilities. Error probabilities do not enter into these standard confirmation theories – which isn’ t to say they couldn’ t. If R is used and observed to be r , we want to compute Pr(R > r ; ~(H & J )). Here, the probability of getting R > 1 is maximal (since (H & J ) entails x ), even if ~(H & J ) is true. So x is “ bad evidence, no test” (BENT) for the conjunction. 7 It’ s not a psychological “ sincerity” being captured; nor is it purely context free. Popper couldn’ t capture it as he never made the error probability turn.

  Time prevents us from entering multiple other rooms displaying paradoxes of confirmation theory, where we’ d meet up with such wonderful zombies as the white shoe confirming all ravens are black, and the “ grue” paradox, which my editor banished from my 1996 book. (See Skyrms 1986 .) Enough tears have been shed. Yet they shouldn’ t be dismissed too readily; they very often contain a puzzle of deep relevance for statistical practice. There are two reasons the tacking paradox above is of relevance to us. The first concerns a problem that arises for both Popperians and Bayesians. There is a large-scale theory T that predicts x , and we want to discern which portion of T to credit. Severity says: do not credit those portions that could not have been found false, even if they’ re false. They are poorly tested. This may not be evident until long after the experiment. We don’ t want to say there is evidence for a large-scale theory such as GTR just because one part was well tested. On the other hand, it may well be that all relativistic theories with certain properties have passed severely.

  Second, the question of whether to measure support with a Bayes boost or with posterior probability arises in Bayesian statistical inference as well. When you hear that what you want is some version of probabilism, be sure to ask if it’ s
a boost (and if so which kind) or a posterior probability, a likelihood ratio, or something else. Now statisticians might rightly say, we don’ t go around tacking on hypotheses like this. True, the Bayesian epistemologist invites trouble by not clearly spelling out corresponding statistical models. They seek a formal logic, holding for statements about radiation, deflection, fish, or whatnot. I think this is a mistake. That doesn’ t preclude a general account for statistical inference; it just won’ t be purely formal.

  Statistical Foundations Need Philosophers of Statistics

  The idea of putting probabilities over hypotheses delivered to philosophy a godsend, an entire package of superficiality.

  (Glymour 2010 , p. 334)

  Given a formal epistemology, the next step is to use it to represent or justify intuitive principles of evidence. The problem to which Glymour is alluding is this: you can start with the principle you want your confirmation logic to reflect, and then reconstruct it using probability. The task, for the formal epistemologist, becomes the problem of assigning priors and likelihoods that mesh with the principle you want to defend. Here’ s an example. Some think that GTR got more confirmation than a rival theory (e.g., Brans-Dicke theory) because the latter is made to fit the data thanks to adjustable parameters (Jefferys and Berger 1992 ). Others think the fact it had adjustable parameters does not alter the confirmation (Earman 1992 ). They too can reconstruct the episode so that Brans-Dicke pays no penalty. The historical episode can be “ rationally reconstructed” to accord with either philosophical standpoint.

 

‹ Prev