Statistical Inference as Severe Testing
Page 17
Data from test T are an indication of, or evidence for, H just to the extent that H has severely passed test T.
“ Just to the extent that” indicates the “ if then” goes in both directions: a claim that passes with low severity is unwarranted; one that passes with high severity is warranted. The phrase “ to the extent that” refers to degrees of severity. That said, evidence requires a decent threshold be met, low severity is lousy evidence. It’ s still useful to point out in our travels when only weak severity suffices.
2. What Enables Induction (as Severe Testing) to Work: Informal and Quasi-formal Severity.
You visited briefly the Exhibit (v) on prions and the deadly brain disease kuru. I’ m going to use it as an example of a quasi-formal inquiry. Prions were found to contain a single protein dubbed PrP. Much to their surprise, researchers found PrP in normal cells too – it doesn’ t always cause disease. Our hero, Prusiner, again worries he’ d “ made a terrible mistake” (and prions had nothing to do with it). There are four strategies:
(a) Can we trick the phenomenon into telling us what it would be like if it really was a mere artifact (H 0 )? Transgenic mice with PrP deliberately knocked out. Were H 0 true, they’ d be expected to be infected as much as normal mice – the test hypothesis H 0 would not be rejected. H 0 asserts an implicationary assumption – one assumed just for testing. Abbreviate it as an i-assumption . It turns out that without PrP, none could be infected. Once PrP is replaced, they can again be infected. They argue, there’ s evidence to reject the artifact error H 0 because a procedure that would have revealed it fails to do so, and instead consistently finds departures from H 0 .
(b) Over a period of more than 30 years, Prusiner and other researchers probed a series of local hypotheses. The levels of our hierarchy of models distinguishes various questions – even though I sketch it horizontally to save space (Figure 2.1 ). Comparativists deny we can proceed with a single hypothesis, but we do. Each question may be regarded as asking: would such and such be an erroneous interpretation of the data? Say the primary question is protein only or not. The alternatives do not include for the moment other “ higher level” explanations about the mechanism of prion infectivity or the like. Given this localization, if H has been severely tested – by which I mean it has passed a severe test – then its denial has passed with low severity. That follows by definition of severity.
(c) Another surprise: the disease-causing form, call it pD, has the same exact amino acids as the normal type, call it pN. What’ s going on? Notice that a method that precluded exceptions to the central dogma (only nucleic acid directs replication of pathogens) would be incapable of identifying the culprit of prion transmission: the misfolding protein. Prusiner’ s prion hypothesis H * is that prions target normal PrP, pinning and flattening their spirals to flip from their usual pN shape into pD, akin to a “ deadly Virginia reel in the brain,” adding newly formed pD’ s to the ends each time (Prusiner Labs 2004). When the helix is long enough, it ruptures, sending more pD seeds to convert normal prions. Another i-assumption to subject to the test of experiment. Trouble is, the process is so slow it can take years to develop. Not long ago, they found a way to deceive the natural state of affairs, while not altering what they want to learn: artificially rupture (with ultrasound or other means) the pathogenic prion. It’ s called protein misfolding cyclical amplification, PMCA. They get huge amounts of pD starting with a minute quantity, even a single molecule, so long as there’ s lots of normal PrP ready to be infected. All normal prions are converted into diseased prions in vitro. They could infer, with severity, that H * gives a correct understanding of prion propagation, as well as corroborate the new research tool: They corroborated both at once, not instantly of course but over a period of a few years.
(d) Knowing the exponential rates of amplification associated with a method, researchers can infer, statistically, back to the amount of initial infectivity present – something they couldn’ t discern before, given the very low concentration of pD in accessible bodily fluids. Constantly improved and even automated, pD can now be detected in living animals for the first time.
What are some key elements? Honest self-criticism of how one may be wrong, deliberate deception to get counterfactual knowledge, conjecturing i-assumptions whose rejection leads to finding something out, and so on. Even researchers who hold different theories about the mechanism of transmission do not dispute PMCA – they can’ t if they want to learn more in the domain. I’ m leaving out the political and personality feuds, but there’ s a good story there (see Prusiner 2014 ). I also didn’ t discuss statistically modeled aspects of prion research, but controlling the mean number of days for incubation allowed a stringent causal argument. I want to turn to statistical induction at a more rudimentary entry point.
3. Neyman’ s Quarrel with Carnap.
Statistics is the sine qua non for extending our powers to severely probe. Jerzy Neyman, with his penchant for inductive behavior and performance rather than inductive inference, is often seen as a villain in the statistics battles. So take a look at a paper of his with the tantalizing title: “ The Problem of Inductive Inference” (Neyman 1955 ). Neyman takes umbrage with the way confirmation philosophers, in particular Carnap, view frequentist inference:
… when Professor Carnap criticizes some attitudes which he represents as consistent with my (“ frequentist” ) point of view, I readily join him in his criticism without, however, accepting the responsibility for the criticized paragraphs.
(p. 13)
In effect, Neyman says I’ d never infer from observing that 150 out of 1000 throws with this die landed on six, “ nothing else being known,” that future throws will result in around 0.15 sixes, as Carnap alleges I would. This is a version of enumerative induction (or Carnap’ s straight rule). You need a statistical model! Carnap should view “ Statistics as the Frequentist Theory of Induction,” says Neyman in a section with this title, here the Binomial model. The Binomial distribution builds on n Bernoulli trials, the success– failure trials (visited in Section 1.4 ). It just adds up all the ways that number of successes could occur:
Carnapians could have formed the straight rule for the Binomial experiment, and argued:
If an experiment can be generated and modeled Binomially, then sample means can be used to reliably estimate population means.
An experiment can be modeled Binomially.
Therefore, we can reliably estimate population means in those contexts.
The reliability comes from controlling the method’ s error probabilities.
4. What Is Neyman‘ s Empirical Justification for Using Statistical Models?
Neyman pays a lot of attention to the empirical justification for using statistical models. Take his introductory text (Neyman 1952 ). The models are not invented from thin air. In the beginning there are records of different results and stable relative frequencies with which they occurred. These may be called empirical frequency distributions. There are real experiments that “ even if carried out repeatedly with the utmost care to keep conditions constant, yield varying results” (ibid., p. 25). These are real, not hypothetical, experiments, he stresses. Examples he gives are roulette wheels (electrically regulated), tossing coins with a special machine (that gives a constant initial velocity to the coin), the number of disintegrations per minute in a quantity of radioactive matter, and the tendency for properties of organisms to vary despite homogeneous breeding. Even though we are unable to predict the outcome of such experiments, a certain stable pattern of regularity emerges rather quickly, even in moderately long series of trials; usually around 30 or 40 trials suffices. The pattern of regularity is in the relative frequency with which specified results occur.
Neyman takes a toy example: toss a die twice and record the frequency of sixes: 0, 1, or 2. Call this a paired trial. Now do this 1000 times. You’ ll have 1000 paired trials. Put these to one side for a moment. Just consider the entire set of 2000 tosses – first order trials
Neyman calls these. Compute the relative frequency of sixes out of 2000. It may not be 1/6, due to the structure of the die or the throwing. Whatever it is, call it f . Now go back to the paired trials. Record the relative frequency of six found in paired trial 1, maybe it’ s 0, the relative frequency of six found in paired trial 2, all the way through your 1000 paired trials. We can then ask: what proportion of the 1000 paired trials had no sixes, what proportion had 1 six, what proportion 2 sixes? We find “ the proportions of pairs with 0, 1 and 2 sixes will be, approximately,
(1 − f )2 , 2f (1 − f ), and f 2 .”
Instead of pairs of trials, consider n -fold trials: each trial has n throws of the die. Compute f as before: it is the relative frequency of six in the 1000n first order trials. Then, turn to the 1000 n -fold trials, and compute the proportion where six occurs k times (for k < n ). It will be very nearly equal to
“ In other words, the relative frequency” of k out of n successes in the n -fold trials “ is connected with the relative frequency of the first order experiments in very nearly the same way as the probability” of k out of n successes in a Binomial trial is related to the probability of success at each trial, θ (Neyman 1952 , p. 26).
The above fact, which has been found empirically many times … may be called the empirical law of large numbers. I want to emphasize that this law applies not only to the simple case connected with the binomial formula … but also to other cases. In fact, this law seems to be perfectly general … Whenever the law fails, we explain the failure by suspecting a “ lack of randomness” in the first order trials.
(ibid., p. 27)
Now consider, not just 1000 repetitions of all n -fold trials, but all. Here, f , the relative frequency of heads is θ in the Binomial probability model with n trials. It is this universe of hypothetical repetitions that our one n -fold sample is a random member of. Figure 2.2 shows the frequency distribution if we chose n = 100 and θ =1/6.
Figure 2.2 Binomial distribution for n = 100, θ = 1/6.
The Law of Large Numbers (LLN) shows we can use the probability derived from the probability model of the experiment to approximate the relative frequencies of outcomes in a series of n -fold trials. The LLN is both an empirical law and a mathematical law. The proofs are based on idealized random samples, but there are certain actual experiments that are well approximated by the mathematical law – something we can empirically test (von Mises 1957 ).
You may bristle at this talk of random experiments, but, as Neyman repeatedly reminds us, these are merely “ picturesque” shorthands for results squarely linked up with empirical tests (Neyman 1952 , p. 23). We keep to them in order to explicate the issues at the focus of our journey. The justification for applying what is strictly an abstraction is no different from other cases of applied mathematics. We are not barred from fruitfully applying geometry because a geometric point is an abstraction.
“ Whenever we succeed in arranging” the data generation such that the relative frequencies adequately approximate the mathematical probabilities in the sense of the LLN, we can say that the probabilistic model “ adequately represents the method of carrying out the experiment” (ibid., p. 19). In those cases we are warranted in describing the results of real experiments as random samples from the population given by the probability model. You can reverse direction and ask about f or θ when unknown. Notice that we are modeling something we do, we may do it well or badly. All we need is that mysterious supernatural powers keep their hands off our attempts to carry out inquiry properly, to take one of Peirce’ s brilliant insights: “ the supernal powers withhold their hands and let me alone, and that no mysterious uniformity … interferes with the action of chance” (2.749) in order to justify induction. End of talk .
I wonder if Carnap ever responded to Neyman’ s grumblings. Why didn’ t philosophers replace a vague phrase like “ if these k out of n successes are all I know about the die” and refer to the Binomial model?, I asked Wesley Salmon in the 1980s. Because, he said, we didn’ t think the Binomial model could be justified without getting into a circle. But it can be tested empirically. By varying a known Binomial process to violate one of the assumptions deliberately, we develop tests that would very probably detect such violations should they occur. This is the key to justifying induction as severe testing: it corrects its assumptions. Testing the assumption of randomness is independent of estimating θ given that it’ s random. Salmon and I met weekly to discuss statistical tests of assumptions when I visited the Center for Philosophy of Science at Pittsburgh in 1989. I think I convinced him of this much (or so he said): the confirmation theorists were too hasty in discounting the possibility of warranting statistical model assumptions.
Souvenir H: Solving Induction Is Showing Methods with Error Control
How is the problem of induction transformed if induction is viewed as severe testing? Essentially, it becomes a matter of showing that there exist methods with good error probabilities. The specific task becomes examining the fields or inquiries that are – and are not – capable of assessing and controlling severity. Nowadays many people abjure teaching the different distributions, preferring instead to generate frequency distributions by resampling a given random sample (Section 4.6 ). It vividly demonstrates what really matters in appealing to probability models for inference, as distinct from modeling phenomena more generally: Frequentist error probabilities are of relevance when frequencies represent the capabilities of inquiries to discern and discriminate various flaws and biases. Where Popper couldn’ t say that methods probably would have found H false, if it is false, error statistical methods let us go further. (See Mayo 2005a.)
The severity account puts forward a statistical philosophy associated with statistical methods. To see what I mean, recall the Likelihoodist. It’ s reasonable to suppose that we favor, among pairs of hypotheses, the one that predicts or makes probable the data – proposes the Likelihoodist. The formal Law of Likelihood (LL) is to capture this, and we appraise it according to how well it succeeds, and how well it satisfies the goals of statistical practice. Likewise, the severe tester proposes, there is a pre-statistical plausibility to infer hypotheses to the extent that they have passed stringent tests. The error statistical methodology is the frequentist theory of induction. Here too the statistical philosophy is to be appraised according to how well it captures and supplies rationales for inductive-statistical inference. The rest of our journey will bear this out. Enjoy the concert in the Captain’ s Central Limit Lounge while the breezes are still gentle, we set out on Excursion 3 in the morn.
1 For example, astronomy, but not astrology, can reliably solve its Duhemian puzzles. Chapter 2, Mayo (1996 ), following my reading of Kuhn (1970 ) on “ normal science.”
2 As noted earlier, I follow Ioannidis in using bias this way, in speaking of selections.
3 For a discussion of novelty and severity in philosophy of science, see Chapter 8 of Mayo (1996 ). Worrall and I have engaged in a battle over this in numerous places (Mayo 2010d, Worrall 1989 , 2010 ). Related exchanges include Mayo 2008 , Hitchcock and Sober 2004 .
4 This is the term used by Andrew Gelman.
5 Gigerenzer calls such a “ no increase” hypothesis the substantive null hypothesis.
6 This is a “ direct replication,” whereas a “ conceptual replication” probes the same hypothesis but through a different phenomenon.
7 The experimental philosophy movement should be distinguished from the New Experimentalism in philosophy.
8 There are some fairly strong statistics, too, of correlations between wives earning more than their husbands and divorce or marital dissatisfaction – although it is likely the disgruntlement comes from both sides.
9 One of the failed replications was the finding that reading a passage against free will contributes to a proclivity for cheating. Both the manipulation and the measured effects are shaky – never mind any statistical issues.
10 “ Last week CNN pulled a story about a study purp
orting to demonstrate a link between a woman’ s ovulation and how she votes… The story was savaged online as ‘ silly,’ ‘ stupid,’ ‘ sexist,’ and ‘ offensive.’” (Bartlett, 2012b )
Excursion 3
Statistical Tests and Scientific Inference
Itinerary
Tour I Ingenious and Severe Tests 3.1 Statistical Inference and Sexy Science: The 1919 Eclipse Test
3.2 N-P Tests: An Episode in Anglo-Polish Collaboration
3.3 How to Do All N-P Tests Do (and More) While a Member of the Fisherian Tribe
Tour II It’ s the Methods, Stupid 3.4 Some Howlers and Chestnuts of Statistical Tests
3.5 P -Values Aren’ t Error Probabilities Because Fisher Rejected Neyman’ s Performance Philosophy
3.6 Hocus-Pocus: P -Values Are Not Error Probabilities, Are Not Even Frequentist!
Tour III Capability and Severity: Deeper Concepts 3.7 Severity, Capability, and Confidence Intervals (CIs)