Statistical Inference as Severe Testing

Page 13

by Deborah G Mayo

Falsification Is Rarely Deductive.

It is rare for any interesting scientific hypotheses to be logically falsifiable. This might seem surprising given all the applause heaped on falsifiability. For a scientific hypothesis H to be deductively falsified, it would be required that some observable result taken together with H yields a logical contradiction (A & ~A). But the only theories that deductively prohibit observations are of the sort one mainly finds in philosophy books: All swans are white is falsified by a single non-white swan. There are some statistical claims and contexts, I will argue, where it’ s possible to achieve or come close to deductive falsification: claims such as, these data are independent and identically distributed (IID). Going beyond a mere denial to replacing them requires more work.

However, interesting claims about mechanisms and causal generalizations require numerous assumptions (substantive and statistical) and are rarely open to deductive falsification. How then can good science be all about falsifiability? The answer is that we can erect reliable rules for falsifying claims with severity. We corroborate their denials. If your statistical account denies we can reliably falsify interesting theories, it is irrelevant to real-world knowledge. Let me draw your attention to an exhibit on a strange disease, kuru, and how it falsified a fundamental dogma of biology.

Exhibit (v): Kuru.

Kuru (which means “ shaking” ) was widespread among the Fore people of New Guinea in the 1960s. In around 3– 6 months, Kuru victims go from having difficulty walking, to outbursts of laughter, to inability to swallow and death. Kuru, and (what we now know to be) related diseases, e.g., mad cow, Creutzfeldt– Jakob, and scrapie, are “ spongiform” diseases, causing brains to appear spongy. Kuru clustered in families, in particular among Fore women and their children, or elderly parents. They began to suspect transmission was through mortuary cannibalism. Consuming the brains of loved ones, a way of honoring the dead, was also a main source of meat permitted to women. Some say men got first dibs on the muscle; others deny men partook in these funerary practices. What we know is that ending these cannibalistic practices all but eradicated the disease. No one expected at the time that understanding kuru’ s cause would falsify an established theory that only viruses and bacteria could be infectious. This “ central dogma of biology” says:

H : All infectious agents have nucleic acid.

Any infectious agent free of nucleic acid would be anomalous for H – meaning it goes against what H claims. A separate step is required to decide when H ’ s anomalies should count as falsifying H . There needn’ t be a cut-off so much as a standpoint as to when continuing to defend H becomes bad science. Prion researchers weren’ t looking to test the central dogma of biology, but to understand kuru and related diseases. The anomaly erupted only because kuru appeared to be transmitted by a protein alone, by changing a normal protein shape into an abnormal fold. Stanley Prusiner called the infectious protein a prion – for which he received much grief. He thought, at first, he’ d made a mistake “ and was puzzled when the data kept telling me that our preparations contained protein but not nucleic acid” (Prusiner 1997 ). The anomalous results would not go away and, eventually, were demonstrated via experimental transmission to animals. The discovery of prions led to a “ revolution” in molecular biology, and Prusiner received a Nobel prize in 1997. It is logically possible that nucleic acid is somehow involved. But continuing to block the falsification of H (i.e., block the “ protein only” hypothesis) precludes learning more about prion diseases, which now include Alzheimer’ s. (See Mayo 2014a.)

Insofar as we falsify general scientific claims, we are all methodological falsificationists. Some people say, “ I know my models are false, so I’ m done with the job of falsifying before I even begin.” Really? That’ s not falsifying. Let’ s look at your method: always infer that H is false, fails to solve its intended problem. Then you’ re bound to infer this even when this is erroneous. Your method fails the minimal severity requirement.

Do Probabilists Falsify?

It isn’ t obvious a probabilist desires to falsify, rather than supply a probability measure indicating disconfirmation, the opposite of a B-boost (a B-bust?), or a low posterior. Members of some probabilist tribes propose that Popper is subsumed under a Bayesian account by taking a low value of Pr( x |H ) to falsify H . That could not work. Individual outcomes described in detail will easily have very small probabilities under H without being genuine anomalies for H. To the severe tester, this as an attempt to distract from the inability of probabilists to falsify, insofar as they remain probabilists. What about comparative accounts (Likelihoodists or Bayes factor accounts), which I also place under probabilism? Reporting that one hypothesis is more likely than the other is not to falsify anything. Royall is clear that it’ s wrong to even take the comparative report as evidence against one of the two hypotheses: they are not exhaustive. (Nothing turns on whether you prefer to put Likelihoodism under its own category.) Must all such accounts abandon the ability to falsify? No, they can indirectly falsify hypotheses by adding a methodological falsification rule. A natural candidate is to falsify H if its posterior probability is sufficiently low (or, perhaps, sufficiently disconfirmed). Of course, they’ d need to justify the rule, ensuring it wasn’ t often mistaken.

The Popperian (Methodological) Falsificationist Is an Error Statistician

When is a statistical hypothesis to count as falsified? Although extremely rare events may occur, Popper notes:

such occurrences would not be physical effects, because, on account of their immense improbability, they are not reproducible at will … If, however, we find reproducible deviations from a macro effect … deduced from a probability estimate … then we must assume that the probability estimate is falsified.

(Popper 1959 , p. 203)

In the same vein, we heard Fisher deny that an “ isolated record” of statistically significant results suffices to warrant a reproducible or genuine effect (Fisher 1935a , p. 14). Early on, Popper (1959 ) bases his statistical falsifying rules on Fisher, though citations are rare. Even where a scientific hypothesis is thought to be deterministic, inaccuracies and knowledge gaps involve error-laden predictions; so our methodological rules typically involve inferring a statistical hypothesis. Popper calls it a falsifying hypothesis . It’ s a hypothesis inferred in order to falsify some other claim. A first step is often to infer an anomaly is real, by falsifying a “ due to chance” hypothesis.

The recognition that we need methodological rules to warrant falsification led Popperian Imre Lakatos to dub Popper’ s philosophy “ methodological falsificationism” (Lakatos 1970 , p. 106). If you look at this footnote, where Lakatos often buried gems, you read about “ the philosophical basis of some of the most interesting developments in modern statistics. The Neyman– Pearson approach rests completely on methodological falsificationism” (ibid., p. 109, note 6). Still, neither he nor Popper made explicit use of N-P tests. Statistical hypotheses are the perfect tool for “ falsifying hypotheses.” However, this means you can’ t be a falsificationist and remain a strict deductivist. When statisticians (e.g., Gelman 2011 ) claim they are deductivists like Popper, I take it they mean they favor a testing account like Popper, rather than inductively building up probabilities. The falsifying hypotheses that are integral for Popper also necessitate an evidence-transcending (inductive) statistical inference.

This is hugely problematic for Popper because being a strict Popperian means never having to justify a claim as true or a method as reliable. After all, this was part of Popper’ s escape from induction. The problem is this: Popper’ s account rests on severe tests, tests that would probably falsify claims if false, but he cannot warrant saying a method is probative or severe, because that would mean it was reliable, which makes Popperians squeamish. It would appear to concede to his critics that Popper has a “ whiff of induction” after all. But it’ s not inductive enumeration. Error statistical methods (whether from statistics or informal)
can supply the severe tests Popper sought. This leads us to Pierre Duhem, physicist and philosopher of science.

Duhemian Problems of Falsification

Consider the simplest form of deductive falsification: If H entails observation O , and we observe ~O , then we infer ~H . To infer ~H is to infer H is false, or there is some discrepancy in what H claims about the phenomenon in question. As with any argument, in order to detach its conclusion (without which there is no inference ), the premises must be true or approximately true. But O is derived only with the help of various additional claims. In statistical contexts, we may group these under two umbrellas: auxiliary factors linking substantive and statistical claims, A 1 & … & An , and assumptions of the statistical model E 1 & … & Ek . You are to imagine a great big long conjunction of factors, in the following argument:

1. If H & (A 1 & … & An ) & (E 1 & … & Ek ), then O.

2. ~O.

3. Therefore, either ~H or ~A 1 or … or ~An or ~E 1 or … or ~Ek .

This is an instance of deductively valid modus tollens . The catchall ~H itself is an exhaustive list of alternatives. This is too ugly for words. Philosophers, ever appealing to logic, often take this as the entity facing scientists who are left to fight their way through a great big disjunction: either H or one (or more) of the assumptions used in deriving observation claim O is to blame for anomaly ~O .

When we are faced with an anomaly for H , Duhem argues, “ The only thing the experiment teaches us is … there is at least one error; but where this error lies is just what it does not tell us” (Duhem 1954 , p. 185). Duhem’ s problem is the problem of pinpointing what is warranted to blame for an observed anomaly with a claim H .

Bayesian philosophers deal with Duhem’ s problem by assigning each of the elements used to derive a prediction a prior probability. Whether H itself, or one of the Ai or Ek , is blamed is a matter of their posterior probabilities. Even if a failed prediction lowers the probability of hypothesis H , its posterior probability may still remain high, while the probability in A 16 , say, drops down. The trouble is that one is free to tinker around with these assignments so that an auxiliary is blamed, and a main hypothesis H retained, or the other way around. Duhem’ s problem is what’ s really responsible for the anomaly (Mayo 1997a ) – what’ s warranted to blame. On the other hand, the Bayesian approach is an excellent way to formally reconstruct Duhem’ s position. In his view, different researchers may choose to restore consistency according to their beliefs or to what Duhem called good sense, “ bon sens.” Popper was allergic to such a thing.

How can Popper, if he is really a deductivist, solve Duhem in order to falsify? At best he’ d subject each of the conjuncts to as stringent a test as possible, and falsify accordingly. This still leaves, Popper admits, a disjunction of non-falsified hypotheses (he thought infinitely many)! Popperian philosophers of science advise you to choose a suitable overall package of hypotheses, assumptions, auxiliaries, on a set of criteria: simplicity, explanatory power, unification and so on. There’ s no agreement on which, nor how to define them. On this view, you can’ t really solve Duhem, you accept or “ prefer” (as Popper said) the large-scale research program or paradigm as a whole. It’ s intended to be an advance over bon sens in blocking certain types of tinkering (see Section 2.4 ). There’ s a remark in the Popper museum display I only recently came across:

[W]e can be reasonably successful in attributing our refutations to definite portions of the theoretical maze. (For we are reasonably successful in this – a fact which must remain inexplicable for one who adopts Duhem’ s and Quine’ s view on the matter.)

(1962 , p. 243)

That doesn’ t mean he supplied an account for such attributions. He should have, but did not. There is a tendency to suppose Duhem’ s problem, like demarcation and induction, is insoluble and that it’ s taboo to claim to solve it. Our journey breaks with these taboos.

We should reject these formulations of Duhem’ s problem, starting with the great big conjunction in the antecedent of the conditional. It is vintage “ rational reconstruction” of science, a very linear but straight-jacketed way to view the problem. Falsifying the central dogma of biology (infection requires nucleic acid) involved no series of conjunctions from H down to observations, but moving from the bottom up , as it were. The first clues that no nucleic acids were involved came from the fact that prions are not eradicated with techniques known to kill viruses and bacteria (e.g., UV irradiation, boiling, hospital disinfectants, hydrogen peroxide, and much else). If it were a mistake to regard prions as having no nucleic acid, then at least one of these known agents would have eradicated it. Further, prions are deactivated with substances known to kill proteins. Post-positive philosophers of science, many of them, are right to say philosophers need to pay more attention to experiments (a trend I call the New Experimentalism), but this must be combined with an account of statistical inference.

Frequentist statistics “ allows interesting parts of a complicated problem to be broken off and solved separately” (Efron 1986 , p. 4). We invent methods that take account of the effect of as many unknowns as possible, perhaps randomizing the rest. I never had to affirm that each and every one of my scales worked in my weighing example, the strong argument from coincidence lets me rule out, with severity, the possibility that accidental errors were producing precisely the same artifact in each case. Duhem famously compared the physicist to the doctor, as opposed to the watchmaker who can pull things apart. But the doctor may determine what it would be like if such and such were operative and distinguish the effects of different sources. The effect of violating an assumption of a constant mean looks very different from a changing variance; despite all the causes of a sore throat, strep tests are quite reliable. Good research should at least be able to embark on inquiries to solve their Duhemian problems.

Popper Comes Up Short.

Popper’ s account rests on severe tests, tests that would probably have falsified a claim if false, but he cannot warrant saying any such thing. High corroboration, Popper freely admits, is at most a report on past successes with little warrant for future reliability.

Although Popper’ s work is full of exhortations to put hypotheses through the wringer, to make them “ suffer in our stead in the struggle for the survival of the fittest” (Popper 1962 , p. 52), the tests Popper sets out are white-glove affairs of logical analysis … it is little wonder that they seem to tell us only that there is an error somewhere and that they are silent about its source. We have to become shrewd inquisitors of errors, interact with them, simulate them (with models and computers), amplify them: we have to learn to make them talk.

(Mayo 1996 , p. 4)

Even to falsify non-trivial claims – as Popper grants – requires grounds for inferring a reliable effect. Singular observation statements will not do. We need “ lift-off.” Popper never saw how to solve the problem of “ drag down” wherein empirical claims are only as reliable as the data involved in reaching them (Excursion 1 ). We cannot just pick up his or any other past account. Yet there’ s no reason to be hamstrung by the limits of the logical positivist or empiricist era. Scattered measurements are not of much use, but with adequate data massaging and averaging we can estimate a quantity of interest far more accurately than individual measurements. Recall Fisher’ s “ it should never be true” in Exhibit (iii), Section 2.1. Fisher and Neyman– Pearson were ahead of Popper here (as was Peirce). When Popper wrote me “ I regret not studying statistics,” my thought was “ not as much as I do.”

Souvenir E: An Array of Questions, Problems, Models

It is a fundamental contribution of modern mathematical statistics to have recognized the explicit need of a model in analyzing the significance of experimental data.

(Suppes 1969 , p. 33)

Our framework cannot abide by oversimplifications of accounts that blur statistical hypotheses and research claims, that ignore assumptions of data or limit the entry of background information to
any one portal or any one form. So what do we do if we’ re trying to set out the problems of statistical inference? I appeal to a general account (Mayo 1996 ) that builds on Patrick Suppes’ (1969 ) idea of a hierarchy of models between models of data, experiment, and theory. Trying to cash out a full-blown picture of inquiry that purports to represent all contexts of inquiry is a fool’ s errand. Or so I discovered after many years of trying. If one is not to land in a Rube Goldberg mess of arrows and boxes, only to discover it’ s not pertinent to every inquiry, it’ s best to settle for pigeonholes roomy enough to organize the interconnected pieces of a given inquiry as in Figure 2.1 .

Figure 2.1 Array of questions, problems, models.

Loosely, there’ s an inferential move from the data model to the primary claim or question via the statistical test or inference model. Secondary questions include a variety of inferences involved in generating and probing conjectured answers to the primary question. A sample: How might we break down a problem into one or more local questions that can be probed with reasonable severity? How should we generate and model raw data, put them in canonical form, and check their assumptions? Remember, we are using “ tests” to encompass probing any claim, including estimates. It’ s standard to distinguish “ confirmatory” and “ exploratory” contexts, but each is still an inferential or learning problem, although criteria for judging the solutions differ. In explorations, we may simply wish to infer that a model is worth developing further, that another is wildly off target.

‹ Prev Next ›