Statistical Inference as Severe Testing
Page 24
(Pearson 1962 , p. 276)
This goes for Fisherian contributions as well. Unlike museums, we won’ t remain static.
The lesson from Tour I of this Excursion is that Fisherian and Neyman– Pearsonian tests may be seen as offering clusters of methods appropriate for different contexts within the large taxonomy of statistical inquiries. There is an overarching pattern:
Just as with the use of measuring instruments, applied to the specific case, we employ the performance features to make inferences about aspects of the particular thing that is measured, aspects that the measuring tool is appropriately capable of revealing.
(Mayo and Cox 2006 , p. 84)
This information is used to ascertain what claims have, and have not, passed severely, post-data. Any such proposed inferential use of error probabilities gives considerable fodder for criticism from various tribes of Fisherians, Neyman– Pearsonians, and Bayesians. We can hear them now:
N-P theorists can only report the preset error probabilities, and can’ t use P -values post-data.
A Fisherian wouldn’ t dream of using something that skirts so close to power as does the “ sensitivity function” Π ( γ ).
Your account cannot be evidential because it doesn’ t supply posterior probabilities to hypotheses.
N-P and Fisherian methods preclude any kind of inference since they use “ the sample space” (violating the LP).
How can we reply? To begin, we need to uncover how the charges originate in traditional philosophies long associated with error statistical tools. That’ s the focus of Tour II.
Only then do we have a shot at decoupling traditional philosophies from those tools in order to use them appropriately today. This is especially so when the traditional foundations stand on such wobbly grounds, grounds largely rejected by founders of the tools. There is a philosophical disagreement between Fisher and Neyman, but it differs importantly from the ones that you’ re presented with and which are widely accepted and repeated in scholarly and popular treatises on significance tests. Neo-Fisherians and N-P theorists, keeping to their tribes, forfeit notions that would improve their methods (e.g., for Fisherians: explicit alternatives, with corresponding notions of sensitivity, and distinguishing statistical and substantive hypotheses; for N-P theorists, making error probabilities relevant for inference in the case at hand).
The spadework on this tour will be almost entirely conceptual: we won’ t be arguing for or against any one view. We begin in Section 3.4 by unearthing the basis for some classic counterintuitive inferences thought to be licensed by either Fisherian or N-P tests. That many are humorous doesn’ t mean disentangling their puzzles is straightforward; a medium to heavy shovel is recommended. We can switch to a light to medium shovel in Section 3.5 : excavations of the evidential versus behavioral divide between Fisher and N-P turn out to be mostly built on sand. As David Cox observes, Fisher is often more performance-oriented in practice, but not in theory, while the reverse is true for Neyman and Pearson. At times, Neyman exaggerates the behavioristic conception just to accentuate how much Fisher’ s tests need reining in. Likewise, Fisher can be spotted running away from his earlier behavioristic positions just to derogate the new N-P movement, whose popularity threatened to eclipse the statistics program that was, after all, his baby. Taking the polemics of Fisher and Neyman at face value, many are unaware how much they are based on personality and professional disputes. Hearing the actual voices of Fisher, Neyman, and Pearson (F and N-P), you don’ t have to accept the gospel of “ what the founders really thought.” Still, there’ s an entrenched history and philosophy of F and N-P: A thick-skinned jacket is recommended. On our third stop (Section 3.6 ) we witness a bit of magic. The very concept of an error probability gets redefined and, hey presto!, a reconciliation between Jeffreys, Fisher, and Neyman is forged. Wear easily removed shoes and take a stiff walking stick. The Unificationist tribes tend to live near underground springs and lakeshore bounds; in the heady magic, visitors have been known to accidentally fall into a pool of quicksand.
3.4 Some Howlers and Chestnuts of Statistical Tests
The well-known definition of a statistician as someone whose aim in life is to be wrong in exactly 5 per cent of everything they do misses its target.
(Sir David Cox 2006a , p. 197)
Showing that a method’ s stipulations could countenance absurd or counterintuitive results is a perfectly legitimate mode of criticism. I reserve the term “ howler” for common criticisms based on logical fallacies or conceptual misunderstandings. Other cases are better seen as chestnuts – puzzles that the founders of statistical tests never cleared up explicitly. Whether you choose to see my “ howler” as a “ chestnut” is up to you. Under each exhibit is the purported basis for the joke.
Exhibit (iii): Armchair Science.
Did you hear the one about the statistical hypothesis tester … who claimed that observing “ heads” on a biased coin that lands heads with probability 0.05 is evidence of a statistically significant improvement over the standard treatment of diabetes, on the grounds that such an event occurs with low probability (0.05)?
The “ armchair” enters because diabetes research is being conducted solely by flipping a coin. The joke is a spin-off from Kadane (2011 ):
Flip a biased coin that comes up heads with probability 0.95, and tails with probability 0.05. If the coin comes up tails reject the null hypothesis. Since the probability of rejecting the null hypothesis if it is true is 0.05, this is a valid 5 percent level test. It is also very robust against data errors; indeed it does not depend on the data at all. It is also nonsense, of course, but nonsense allowed by the rules of significance testing.
(p. 439)
Basis for the joke: Fisherian test requirements are (allegedly) satisfied by any method that rarely rejects the null hypothesis.
But are they satisfied? I say no. The null hypothesis in Kadane’ s example can be in any field, diabetes, or the mean deflection of light. (Yes, Kadane affirms this.) He knows the test entirely ignores the data, but avers that “ it has the property that Fisher proposes” (Kadane 2016 , p. 1). Here’ s my take: in significance tests and in scientific hypotheses testing more generally, data can disagree with H only by being counter to what would be expected under the assumption that H is correct. An improbable series of coin tosses or plane crashes does not count as a disagreement from hypotheses about diabetes or light deflection. In Kadane’ s example, there is accordance so long as a head occurs – but this is a nonsensical distance measure. Were someone to tell you that any old improbable event (three plane crashes in one week) tests a hypothesis about light deflection, you would say that person didn’ t understand the meaning of testing in science or in ordinary life. You’ d be right (for some great examples, see David Hand 2014).
Kadane knows it’ s nonsense, but thinks the only complaint a significance tester can have is its low power. What’ s the power of this “ test” against any alternative? It’ s just the same as the probability it rejects, period, namely, 0.05. So an N-P tester could at least complain. Now I agree that bad tests may still be tests; but I’ m saying Kadane’ s is no test at all. If you want to insist Fisher permits this test, fine, but I don’ t think that’ s a very generous interpretation. As egregious as is this howler, it is instructive because it shows like nothing else the absurdity of a crass performance view that claims: reject the null and infer evidence of a genuine effect, so long as it is done rarely. Crass performance is bad enough, but this howler commits a further misdemeanor: It overlooks the fact that a test statistic d( x ) must track discrepancies from H 0 , becoming bigger (or smaller) as discrepancies increase (I list it as (ii) in Section 3.2 ). With any sensible distance measure, a misfit with H 0 must be because of the falsity of H 0 . The probability of “ heads” under a hypothesis about light deflection isn’ t even defined, because deflection hypotheses do not assign probabilities to coin-tossing trials. Fisher wanted test statistics to reduce the data from the generat
ing mechanism, and here it’ s not even from the mechanism.
Kadane regards this example as “ perhaps the most damaging critique” of significance tests (2016 , p. 1). Well, Fisher can get around this easily enough.
Exhibit (iv): Limb-sawing Logic. Did you hear the one about significance testers sawing off their own limbs?
As soon as they reject the null hypothesis H 0 based on a small P -value, they no longer can justify the rejection because the P -value was computed under the assumption that H 0 holds, and now it doesn’ t.
Basis for the joke: If a test assumes H , then as soon as H is rejected, the grounds for its rejection disappear!
This joke, and I swear it is widely repeated but I won’ t name names, reflects a serious misunderstanding about ordinary conditional claims. The assumption we use in testing a hypothesis H , statistical or other, is an implicationary or i-assumption . We have a conditional, say: If H then expect x , with H the antecedent. The entailment from H to x , whether it is statistical or deductive, does not get sawed off after the hypothesis or model H is rejected when the prediction is not borne out. A related criticism is that statistical tests assume the truth of their test or null hypotheses. No, once again, they may serve only as i-assumptions for drawing out implications. The howler occurs when a test hypothesis that serves merely as an i-assumption is purported to be an actual assumption, needed for the inference to go through. A little logic goes a long way toward exposing many of these howlers. As the point is general, we use H .
This next challenge is by Harold Jeffreys. I won’ t call it a howler because it hasn’ t, to my knowledge, been excised by testers: it’ s an old chestnut, and a very revealing one.
Exhibit (v): Jeffreys’ Tail Area Criticism.
Did you hear the one about statistical hypothesis testers rejecting H 0 because of outcomes it failed to predict?
What’ s unusual about that?
What’ s unusual is that they do so even when these unpredicted outcomes haven’ t occurred!
Actually, one can’ t improve upon the clever statement given by Jeffreys himself. Using P -values, he avers, implies that “ a hypothesis that may be true may be rejected because it has not predicted observable results that have not occurred ” (1939 /1961 p. 385).
Basis for the joke: The P -value, Pr(d ≥ d 0 ; H 0 ), uses the “ tail area” of the curve under H 0 . d 0 is the observed difference, but {d ≥ d 0 } includes differences even further from H 0 than d 0 .
This has become the number one joke in comical statistical repertoires. Before debunking it, let me say that Jeffreys shows a lot of admiration for Fisher: “ I have in fact been struck repeatedly in my own work … to find that Fisher had already grasped the essentials by some brilliant piece of common sense, and that his results would either be identical with mine or would differ only in cases where we should both be very doubtful” (ibid., p. 393). The famous quip is funny because it seems true, yet paradoxical. Why consider more extreme outcomes that didn’ t occur? The non-occurrence of more deviant results, Jeffreys goes on to say, “ might more reasonably be taken as evidence for the law [in this case, H 0 ], not against it” (ibid., p. 385). The implication is that considering outcomes beyond d 0 is to unfairly discredit H 0 , in the sense of finding more evidence against it than if only the actual outcome d 0 is considered.
The opposite is true.
Considering the tail area makes it harder, not easier, to find an outcome statistically significant (although this isn’ t the only function of the tail area). Why? Because it requires not merely that Pr(d = d 0 ; H 0 ) be small, but that Pr(d ≥ d 0 ; H 0 ) be small. This alone squashes the only sense in which this could be taken as a serious criticism of tests. Still, there’ s a legitimate question about why the tail area probability is relevant. Jeffreys himself goes on to give it a rationale: “ If mere improbability of the observations, given the hypothesis, was the criterion, any hypothesis whatever would be rejected. Everybody rejects the conclusion” (ibid., p. 385), so some other criterion is needed. Looking at the tail area supplies one, another would be a prior, which is Jeffreys’ preference.
It’ s worth reiterating Jeffreys’ correctly pointing out that “ everybody rejects” the idea that the improbability of data under H suffices for evidence against H . Shall we choose priors or tail areas? Jeffreys chooses default priors. Interestingly, as Jeffreys recognizes, for Normal distributions “ the tail area represents the probability, given the data” that the actual discrepancy is in the direction opposite to that observed – d 0 is the wrong “ sign” (ibid., p. 387). (This relies on a uniform prior probability for the parameter.) This connection between P -values and posterior probabilities is often taken as a way to “ reconcile” them, at least for one-sided tests (Sections 4.4, 4.5). This was not one of Fisher’ s given rationales.
Note that the joke talks about outcomes the null does not predict – just what we wouldn’ t know without an assumed test statistic or alternative. One reason to evoke the tail area in Fisherian tests is to determine what H 0 “ has not predicted,” that is, to identify a sensible test statistic d( x ). Fisher, strictly speaking, has only the null distribution, with an implicit interest in tests with sensitivity of a given type. Fisher discusses this point in relation to the lady tasting tea (1935a , pp. 14– 15). Suppose I take an observed difference d 0 as grounds to reject H 0 on account of it’ s being improbable under H 0 , when in fact larger differences (larger d values) are even more probable under H 0 . Then, as Fisher rightly notes, the improbability of the observed difference would be a poor indication of underlying discrepancy. (In N-P terms, it would be a biased test.) Looking at the tail area would reveal this fallacy; whereas it would be readily committed, Fisher notes, in accounts that only look at the improbability of the observed outcome d 0 under H 0 .
When E. Pearson (1970 ) takes up Jeffreys’ question: “ Why did we use tail-area probabilities… ?” , his reply is that “ this interpretation was not part of our approach” (p. 464). Tail areas simply fall out of the N-P desiderata of good tests. Given the lambda criterion one needed to decide at what point H 0
should be regarded as no longer tenable, that is where should one choose to bound the rejection region? To help in reaching this decision it appeared that the probability of falling into the region chosen, if H 0 were true, was one necessary piece of information.
(ibid.)
So looking at the tail area could be seen as the result of formulating a sensible distance measure (for Fisher), or erecting a good critical region (for Neyman and Pearson).
Pearson’ s reply doesn’ t go far enough; it does not by itself explain why reporting the probability of falling into the rejection region is relevant for inference . It points to a purely performance-oriented justification that I know Pearson shied away from: It ensures data fall in a critical region rarely under H 0 and sufficiently often under alternatives in H 1 – but this tends to be left as a pre-data, performance goal (recall Birnbaum’ s Conf, Souvenir D). It is often alleged the N-P tester only reports whether or not x falls in the rejection region. Why are N-P collapsing all outcomes in this region?
In my reading, the error statistician does not collapse the result beyond what the minimal sufficient statistic requires for the question at hand. From our Translation Guide, Souvenir C, considering (d( X ) ≥ d( x 0 )) signals that we’ re interested in the method, and we insert “ the test procedure would have yielded” before d( X ). We report what was observed x 0 and the corresponding d( x 0 ) – or d 0 – but we require the methodological probability, via the sampling distribution of d( X ) – abbreviated as d . This could mean looking at other stopping points, other endpoints, and other variables. We require that with high probability our test would have warned us if the result could easily have come about in a universe where the test hypothesis is true, that is Pr(d( X ) < d( x 0 ); H 0 ) is high. Besides, we couldn’ t throw away the detailed data, since they’ re needed to audit model assumptions.
To conclude
this exhibit, considering the tail area does not make it easier to reject H 0 but harder. Harder because it’ s not enough that the outcome be improbable under the null, outcomes even greater must be improbable under the null. Pr(d ( X ) = d ( x 0 ); H 0 ) could be small while Pr(d ( X ) ≥ d ( x 0 ); H 0 ) not small. This leads to blocking a rejection when it should be because it means the test could readily produce even larger differences under H 0 . Considering other possible outcomes that could have arisen is essential for assessing the test’ s capabilities. To understand the properties of our inferential tool is to understand what it would do under different outcomes, under different conjectures about what’ s producing the data. (Yes, the sample space matters post-data.) I admit that neither Fisher nor N-P adequately pinned down an inferential justification for tail areas, but now we have.
A bit of foreshadowing of a later shore excursion: some argue that looking at d ( X ) ≥ d ( x 0 ) actually does make it easier to find evidence against H 0 . How can that be? Treating (1 – β )/ α as a kind of likelihood ratio in favor of an alternative over the null, then fed into a Likelihoodist or Bayesian algorithm, it can appear that way. Stay tuned.
Exhibit (vi): Two Measuring Instruments of Different Precisions.
Did you hear about the frequentist who, knowing she used a scale that’ s right only half the time, claimed her method of weighing is right 75% of the time ?
She says, “ I flipped a coin to decide whether to use a scale that’ s right 100% of the time, or one that’ s right only half the time, so, overall, I’ m right 75% of the time.” (She wants credit because she could have used a better scale, even knowing she used a lousy one.)