Statistical Inference as Severe Testing
Page 26
(pp. 44– 5)
It is assumed Fisher is speaking of N-P, or at least Neyman. But N-P do not recommend such habitual practice.
Long Runs Are Hypothetical.
What about the allegation that N-P error probabilities allude to actual long-run repetitions, while the P -value is a hypothetical distribution? N-P error probabilities are also about hypothetical would-be’ s. Each sample of size n gives a single value of the test statistic d( X ). Our inference is based on this one sample. The third requirement (Pearson’ s “ Step 3” ) for tests is that we be able to compute the distribution of d( X ), under the assumption that the world is approximately like H 0 , and under discrepancies from H 0 . Different outcomes would yield different d( X ) values, and we consider the frequency distribution of d( X ) over hypothetical repetitions.
At the risk of overkill, the sampling distribution is all about hypotheticals: the relative frequency of outcomes under one or another hypothesis. These also equal the relative frequencies assuming you really did keep taking samples in a long run, tiring yourself out in the process. It doesn’ t follow that the value of the hypothetical frequencies depends on referring to, much less actually carrying out, that long run. A statistical hypothesis has implications for some hypothetical long run in terms of how frequently this or that would occur. A statistical test uses the data to check how well the predictions are met. The sampling distribution is the testable meeting-ground between the two.
The same pattern of reasoning is behind resampling from the one and only sample in order to generate a sampling distribution. (We meet with resampling in Section 4.10 .) The only gap is to say why such a hypothetical (or counterfactual) is relevant for inference in the case at hand. Merely proposing that error probabilities give a vague “ strength of evidence” to an inference won’ t do. Our answer is that they capture the capacities of tests, which in turn tell us how severely tested various claims may be said to be.
It’ s Time to Get Beyond the “ Inconsistent Hybrid” Charge
Gerd Gigerenzer is a wonderful source of how Fisherian and N-P methods led to a statistical revolution in psychology. He is famous for, among much else, arguing that the neat and tidy accounts of statistical testing in social science texts are really an inconsistent hybrid of elements from N-P’ s behavioristic philosophy and Fisher’ s more evidential approach (Gigerenzer 2002 , p. 279). His tribe is an offshoot of the Incompatibilists, but with a Freudian analogy to illuminate the resulting tension and anxiety that a researcher is seen to face.
N-P testing, he says, “ functions as the Superego of the hybrid logic” (ibid., p. 280). It requires alternatives, significance levels, and power to be prespecified, while strictly outlawing evidential or inferential interpretations about the truth of a particular hypothesis. The Fisherian “ Ego gets things done … and gets papers published” (ibid.). Power is ignored, and the level of significance is found after the experiment, cleverly hidden by rounding up to the nearest standard level. “ The Ego avoids … exact predictions of the alternative hypothesis, but claims support for it by rejecting a null hypothesis” and in the end is “ left with feelings of guilt and shame for having violated the rules” (ibid.). Somewhere in the background lurks his Bayesian Id, driven by wishful thinking into misinterpreting error probabilities as degrees of belief.
As with most good caricatures, there is a large grain of truth in Gigerenzer’ s Freudian metaphor – at least as the received view of these methods. I say it’ s time to retire the “ inconsistent hybrid” allegation. Reporting the attained significance level is entirely legitimate and is recommended in N-P tests, so long as one is not guilty of other post-data selections causing actual P -values to differ from reported or nominal ones. By failing to explore the inferential basis for the stipulations, there’ s enormous unclarity as to what’ s being disallowed and why, and what’ s mere ritual or compulsive hand washing (as he might put it (ibid., p. 283)). Gigerenzer’ s Ego might well deserve to feel guilty if he has chosen the hypothesis, or characteristic to be tested, based on the data, or if he claims support for a research hypothesis by merely rejecting a null hypothesis – the illicit NHST animal. A post-data choice of test statistic may be problematic, but not an attained significance level.
Gigerenzer recommends that statistics texts teach the conflict and stop trying “ to solve the conflict between its parents by denying its parents” (2002 , p. 281). I, on the other hand, think we should take responsibility for interpreting the tools according to their capabilities. Polemics between Neyman and Fisher, however lively, taken at face value, are a highly unreliable source; we should avoid chiseling into even deeper stone the hackneyed assignments of statistical philosophy – “ he’ s inferential, he’ s an acceptance sampler.” The consequences of the “ inconsistent hybrid” allegation are dire: both schools are caricatures, robbed of features that belong in an adequate account.
Hubbard and Bayarri (2003) are a good example of this; they proclaim an N-P tester is forbidden – forbidden! – from reporting the observed P -value. They eventually concede that an N-P test “ could be defined equivalently in terms of the p value … the null hypothesis should be rejected if the observed p < α , and accepted otherwise” (p. 175). But they aver “ no matter how small the p value is, the appropriate report is that the procedure guarantees a 100 α % false rejection of the null on repeated use” (ibid.). An N-P tester must robotically obey the reading that has grown out of the Incompatibilist tribe to which they belong. A user must round up to the predesignated α . This type of prohibition gives a valid guilt trip to Gigerenzer’ s Ego; yet the hang-up stems from the Freudian metaphor, not from Neyman and Pearson, who say:
it is doubtful whether the knowledge that Pz [the P -value associated with test statistic z ] was really 0.03 (or 0.06) rather than 0.05, … would in fact ever modify our judgment … regarding the origin of a single sample.
(Neyman and Pearson 1928 , p. 27)
But isn’ t it true that rejection frequencies needn’ t be indicative of the evidence against a null? Yes. Kadane’ s example, if allowed, shows how to get a small rejection frequency with no evidence. But this was to be a problem for Fisher, solved by N-P (even if Kadane is not fond of them either). Granted, even in tests not so easily dismissed, crude rejection frequencies differ from an evidential assessment, especially when some of the outcomes leading to rejection vary considerably in their evidential force. This is the lesson of Cox’ s famous “ two machines with different precisions.” Some put this in terms of selecting the relevant reference set which “ need not correspond to all possible repetitions of the experiment” (Kalbfleisch and Sprott 1976 , p. 272). We’ ve already seen that relevant conditioning is open to a N-P tester. Others prefer to see it as a matter of adequate model specification. So once again it’ s not a matter of Fisher vs. N-P.
I’ m prepared to admit Neyman’ s behavioristic talk. Mayo (1996 , Chapter 11) discusses: “ Why Pearson rejected the (behavioristic) N-P theory” (p. 361). Pearson does famously declare that “ the behavioristic conception is Neyman’ s not mine” (1955 , p. 207). Furthermore, Pearson explicitly addresses “ the situation where statistical tools are applied to an isolated investigation of considerable importance …” (1947 , p. 170).
In other and, no doubt, more numerous cases there is no repetition of the same type of trial or experiment, but all the same we can and many of us do use the same test rules … Why do we do this? … Is it because the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment?
Or is it because we are content that the application of a rule, now in this investigation, now in that, should result in a long-run frequency of errors in judgment which we control at a low figure?
(ibid., p. 172)
While tantalizingly leaving the answer dangling, it’ s clear that for Pearson: “ the formulation of the case in terms of hypothetical repetition helps to that clarity of view needed for sound judgment�
� (ibid.) in learning about the particular case at hand. He gives an example from his statistical work in World War II:
Two types of heavy armour-piercing naval shell of the same caliber are under consideration; they may be of different design or made by different firms … Twelve shells of one kind and eight of the other have been fired; two of the former and five of the latter failed to perforate the plate …
(Pearson 1947 , 171)
Starting from the basis that individual shells will never be identical in armour-piercing qualities, … he has to consider how much of the difference between (i) two failures out of twelve and (ii) five failures out of eight is likely to be due to this inevitable variability.
(ibid.)
He considers what other outcomes could have occurred, and how readily, in order to learn what variability alone is capable of producing. 5 Pearson opened the door to the evidential interpretation, as I note in 1996, and now I go further.
Having looked more carefully at the history before the famous diatribes, and especially at Neyman’ s applied work, I now hold that Neyman largely rejected it as well! Most of the time, anyhow. But that’ s not the main thing. Even if we couldn’ t point to quotes and applications that break out of the strict “ evidential versus behavioral” split: we should be the ones to interpret the methods for inference, and supply the statistical philosophy that directs their right use.
Souvenir L: Beyond Incompatibilist Tunnels
What people take away from the historical debates is Fisher (1955 ) accusing N-P, or mostly Neyman, of converting his tests into acceptance sampling rules more appropriate for five-year plans in Russia, or making money in the USA, than for science. Still, it couldn’ t have been too obvious that N-P distorted his tests, since Fisher tells us only in 1955 that it was Barnard who explained that, despite agreeing mathematically in very large part, there is this distinct philosophical position. Neyman suggests that his terminology was to distinguish what he (and Fisher!) were doing from the attempts to define a unified rational measure of belief on hypotheses. N-P both denied there was such a thing. Given Fisher’ s vehement disavowal of subjective Bayesian probability, N-P thought nothing of crediting Fisherian tests as a step in the development of “ inductive behavior” (in their 1933 paper).
The myth of the radical difference in either methods or philosophy is a myth. Yet, as we’ ll see, the hold it has over people continues to influence the use and discussion of tests. It’ s based almost entirely on sniping between Fisher and Neyman from 1935 until Neyman leaves for the USA in 1938. Fisher didn’ t engage much with statistical developments during World War II. Barnard describes Fisher as cut off “ by some mysterious personal or political agency. Fisher’ s isolation occurred, I think, at a particularly critical time, when opportunities existed for a fruitful fusion of ideas stemming from Neyman and Pearson and from Fisher” (Barnard 1985 , p. 2). Lehmann observes that Fisher kept to his resolve not to engage in controversy with Neyman until the highly polemical exchange of 1955 at age 65. Fisher alters some of the lines of earlier editions of his books. For instance, Fisher’ s disinterest in the attained P -value was made clear in Statistical Methods for Research Workers (SMRW) (1934a , p. 80):
… in practice we do not want to know the exact value of P for any observed value of [the test statistic], but, in the first place, whether or not the observed value is open to suspicion.
If P is between .1 and .9 there is certainly no reason to suspect the hypothesis tested. If it is below .02 it is strongly indicated that the hypothesis fails to account for the whole of the facts. We shall not often be astray if we draw a conventional line at .05.
Lehmann explains that it was only “ fairly late in life, Fisher’ s attitude had changed” (Lehmann 2011 , p. 52). In the 13th edition of SMRW, Fisher changed his last sentence to:
The actual value of P obtainable … indicates the strength of the evidence against the hypothesis. [Such a value] is seldom to be disregarded.
(p. 80)
Even so, this at most suggests how the methodological (error) probability is thought to provide a measure of evidential strength – it doesn’ t abandon error probabilities. There’ s a deeper reason for this backtracking by Fisher; I’ ll save it for Excursion 5 . One other thing to note: F and N-P were creatures of their time. Their verbiage reflects the concern with “ operationalism” and “ behaviorism,” growing out of positivistic and verificationist philosophy. I don’ t deny the value of tracing out the thrust and parry between Fisher and Neyman in these excursions. None of the founders solved the problem of an inferential interpretation of error probabilities – though they each offered tidbits. Their name-calling: “ you’ re too mechanical,” “ no you are,” at most shows, as Gigerenzer and Marewski observe, that they all rejected mechanical statistics (2015 , p. 422).
The danger is when one group’ s interpretation is the basis for a historically and philosophically “ sanctioned” reinterpretation of one or another method. Suddenly, rigid rules that the founders never endorsed are imposed. Through the Incompatibilist philosophical tunnel, as we are about to see, these reconstruals may serve as an effective way to dismiss the entire methodology – both F and N-P. After completing this journey, you shouldn’ t have to retrace this “ he said/they said” dispute again. It’ s the methods, stupid.
3.6 Hocus-Pocus: P -values Are Not Error Probabilities, Are Not Even Frequentist!
Fisher saw the p value as a measure of evidence, not as a frequentist evaluation. Unfortunately, as a measure of evidence it is very misleading.
(Hubbard and Bayarri 2003 , p. 181)
This entire tour, as you know, is to disentangle a jungle of conceptual issues, not to defend or criticize any given statistical school. In sailing forward to scrutinize Incompatibilist tribes who protest against mixing p ’ s and α ’ s, we need to navigate around a pool of quicksand. They begin by saying P -values are for evidence and inference, unlike error probabilities. N-P error probabilities are too performance oriented to be measures of evidence. In the next breath we’ re told P -values aren’ t good measures of evidence either. A good measure of evidence, it’ s assumed, should be probabilist, in some way, and P -values disagree with probabilist measures, be they likelihood ratios, Bayes factors, or posteriors. If you reinterpret error probabilities, they promise, you can make peace with all tribes. Whether we get on firmer ground or sink in a marshy swamp will have to be explored.
Berger’ s Unification of Jeffreys, Neyman, and Fisher
With “ reconciliation” and “ unification” in the air, Jim Berger, a statistician deeply influential in statistical foundations, sets out to see if he can get Fisher, Neyman, and (non-subjective) Bayesian Jeffreys to agree on testing (2003). A compromise awaits, if we nip and tuck the meaning of “ error probability” (Section 3.5). If you’ re an N-P theorist and like your error probability1 , you can keep it he promises, but he thinks you will want to reinterpret it. It then becomes possible to say that a P -value is not an error probability (full stop), meaning it’ s not the newly defined error probability2. What’ s error probability2 ? It’ s a type of posterior probability in a null hypothesis, conditional on the outcome, given a prior. It may still be frequentist in some sense. On this reinterpretation, P -values are not error probabilities. Neither are N-P Type I and II, α and β . Following the philosopher’ s clarifying move via subscripts, there is error probability1 – the usual frequentist notion – and error probability2 – notions from probabilism that had never been called error probabilities before.
In commenting on Berger (2003 ), I noted my surprise at his redefinition (Mayo 2003b ). His reply: “ Why should the frequentist school have exclusive right to the term ‘ error probability?’ It is not difficult to simply add the designation ‘ frequentist’ (or Type I or Type II) or ‘ Bayesian’ to the term to differentiate between the schools” (Berger 2003 , p. 30). That would work splendidly. So let error probability2 = Bayesian error probability. Frankly, I didn’ t think Bayes
lans would want the term. In a minute, however, Berger will claim they alone are the true frequentist error probabilities! If you feel yourself sinking in a swamp of sliding meanings, remove your shoes, flip onto your back atop your walking stick and you’ ll stop sinking. Then, you need only to pull yourself to firm land. (See Souvenir M.)
The Bayes Factor.
In 1987, Berger and Sellke said that in order to consider P -values as error probabilities we need to introduce a decision or test rule. Berger (2003 ) proposes such a rule and error probability2 is born. In trying to merge different methodologies, there’ s always a danger of being biased in favor of one, begging the question against the others. From the severe tester’ s perspective, this is what happens here, but so deftly that you might miss it if you blink. 6
His example involves X 1 , … , Xn IID data from N( θ , σ 2 ), with σ 2 known, and the test is of two simple hypotheses H 0 : θ = θ 0 and H 1 : θ = θ 1 . Consider now their two P -values: “ for i = 0, 1, let pi be the p -value in testing Hi against the other hypothesis” (ibid., p. 6). Then reject H 0 when p 0 ≤ p 1 , and accept H 0 otherwise. If you reject H 0 you next compute the posterior probability of H 0 using one of Jeffreys’ default priors giving 0.5 to each hypothesis. The computation rests on the Bayes factor or likelihood ratio B( x ) = Pr( x |H 0 )/Pr( x |H 1 ):
Pr(H 0 | x ) = B( x )/[1 + B( x )].
The priors drop out, being 0.5. As before, x refers to a generic value for X.