Statistical Inference as Severe Testing
Page 53
In his 1980 retraction, Hacking, following Peirce, denies there’ s a logic of statistical inference explaining it was a false analogy with deduction that led everyone to suppose the probability is to be assigned to the conclusion rather than to the overall method (Section 2.1). We should all be over that discredited view by now.
Elliott Sober.
Our passage pops up in Elliott Sober, who combines the more disconcerting aspects of earlier interpretations. According to Sober (2008 , p. 7), “ Neyman and Pearson think of acceptance and rejection” as acts that should only “ be regulated by prudential considerations, not by ‘ evidence,’ which, for them, is a will o’ the wisp … There is no such thing as allowing ‘ evidence’ to regulate what we believe. Rather, we must embrace a policy and stick to it.” Sober regards this as akin to Pascal’ s theological argument for believing in God, declaring, “ Pascal’ s concept of prudential acceptance lives on in frequentism” (ibid.).
I don’ t think it’ s plausible to read Neyman and Pearson, their theory or their applications, and come away with the view they are denying such a thing as evidence. Continuing in the paper critics love to cite: N-P 1933:
We ask whether the variation in a certain character may be considered as following the normal law; … whether regression is linear; whether the variance in a number of samples differs significantly. … [W]e are not concerned with the exact value of particular parameters, but seek for information regarding the conditions and factors controlling the events.
(ibid., p. 145)
Plainly, using data to obtain information regarding factors controlling events is indicative of using data as evidence. What goes under the banner of reliabilism in epistemology is scarcely different from what N-P offer for statistics: a means of arriving at a measurement through a procedure that is rarely wrong and, if wrong, is not far wrong, with mistakes likely discovered in later probes.
I could multiply ad nauseum similar readings of this passage. By the time statistician Robert Kass (2011 , p. 8) gets to it, the construal is so hardened he doesn’ t even need to provide a reference:
We now recognize Neyman and Pearson to have made permanent, important contributions to statistical inference through their introduction of hypothesis testing and confidence. From today’ s vantage point, however, their behavioral interpretation seems quaint, especially when represented by their famous dictum,
at which point our famous passage appears. I can see rejecting the extreme behavioristic view, but am not sure why Kass calls it “ quaint” for an account to control error probabilities. I thought he (here and in Brown and Kass 2009 ) was at pains to insist on performance characteristics, declaring even Bayesians need them. I return to Kass in Excursion 6. At least he does not try to stick them with Pascal’ s wager!
Let’ s grant for the sake of argument that Neyman became a full blown behaviorist and thought the only justification for tests was low errors in the long run. Pearson absolutely disagreed. What’ s interesting is this. In the context of why Neyman regards the Bertrand– Borel debate as having “ served as an inspiration to Egon S. Pearson and myself,” the relevance of error probabilities is not hard to discern. Why report what would happen in repetitions were outcome x to be taken as indicating claim C ? Because it’ s the way to design stringent tests and make probability claims pre-data that are highly informative post-data as to how well tested claims are.
5.8 Neyman’ s Performance and Fisher’ s Fiducial Probability
Many say fiducial probability was Fisher’ s biggest blunder; others suggest it still hasn’ t been understood. Most discussions avoid a side trip to the Fiducial Islands altogether, finding the surrounding brambles too thorny to negotiate. I now think this is a mistake, and it is a mistake at the heart of the consensus interpretation of the N-P vs. Fisher debate. We don’ t need to solve the problems of fiducial inference, fortunately, to avoid taking the words of the Fisher– Neyman dispute out of context. Although the Fiducial Islands are fraught with minefields, new bridges are being built connecting some of the islands to Power Peninsula and the general statistical mainland.
So what is fiducial inference? I begin with Cox’ s contemporary treatment, distilled from much controversy. The following passages swap his upper limit for the lower limit to keep to the example Fisher uses:
We take the simplest example, … the normal mean when the variance is known, but the considerations are fairly general. The lower limit
derived here from the probability statement
is a particular instance of a hypothetical long run of statements a proportion 1 − c of which will be true, … assuming our model is sound. We can, at least in principle, make such a statement for each c and thereby generate a collection of statements, sometimes called a confidence distribution .
(Cox 2006a , p. 66; for , for , and zc for )
Once is observed, is what Fisher calls the fiducial c percent limit for μ . It is, of course, the specific 1−c lower confidence interval estimate (Section 3.7).
Here’ s Fisher in the earliest paper on fiducial inference in 1930. He sets 1 − c as 0.95. Starting from the significance test of a specific μ , he identifies the corresponding 95 percent value , such that in 95% of samples . In the normal testing example, . Notice is the cut-off for a 0.05 one-sided test T+ (of μ ≤ μ 0 vs. μ > μ 0 ) .
[W]e have a relationship between the statistic [ ] and the parameter μ , such that is the 95 per cent. value corresponding to a given μ , and this relationship implies the perfectly objective fact that in 5 per cent. of samples [ . That is,
(Fisher 1930 , p. 533; substituting μ for θ and for T.)
occurs whenever the generic . For a particular observed , − 1.65 σ /√ n is the “ fiducial 5 per cent. value of μ .”
We may know as soon as is calculated what is the fiducial 5 per cent. value of μ , and that the true value of μ will be less than this value in just 5 per cent. of trials. This then is a definite probability statement about the unknown parameter μ which is true irrespective of any assumption as to its a priori distribution.
(ibid.) 3
This seductively suggests μ < gets the probability 0.05 – a fallacious probabilistic instantiation.
However, there’ s a kosher probabilistic statement about , it’ s just not a probabilistic assignment to a parameter. Instead, a particular substitution is, to paraphrase Cox, “ a particular instance of a hypothetical long run of statements 95% of which will be true.” After all, Fisher was abundantly clear that the fiducial bound should not be regarded as an inverse inference to a posterior probability. We could only obtain an inverse inference by considering μ to have been selected from a superpopulation of μ ’ s, with known distribution. The posterior probability would then be a deductive inference and not properly inductive. In that case, says Fisher, we’ re not doing inverse or Bayesian inference.
In reality the statements with which we are concerned differ materially in logical content from inverse probability statements, and it is to distinguish them from these that we speak of the distribution derived as a fiducial frequency distribution, and of the working limits, at any required level of significance, … as the fiducial limits at this level.
(Fisher 1936 , p. 253)
So, what is being assigned the fiducial probability? It’ s the method of reaching claims to which the probability attaches. This is even clearer in his 1936 discussion where σ is unknown and must be estimated. Because and S (using the Student’ s t pivot) are sufficient statistics “ we may infer, without any use of probabilities a priori , a frequency distribution for μ which shall correspond with the aggregate of all such statements … to the effect that the probability μ is less than is exactly one in forty” (ibid., p. 253). This uses Student’ s t distribution with n = 15. It’ s plausible, at that point, to suppose Fisher means for to be a random variable.
Suppose you’ re Neyman and Pearson working in the early 1930s aiming to clarify and justify Fisher’ s methods. ‘ I see wh
at’ s going on,’ we can imagine Neyman declaring. There’ s a method for outputting statements such as would take the general form
Some would be in error, others not. The method outputs statements with a probability (some might say a propensity) of 0.975 of being correct. “ We may look at the purpose of tests from another viewpoint” : probability ensures us of the performance of a method (it’ s methodological).
At the time, Neyman thought his development of confidence intervals (in 1930) was essentially the same as Fisher’ s fiducial intervals. There was evidence for this. Recall the historical side trip of Section 3.7 . When Neyman gave a (1934) paper to the Royal Statistical Society discussing confidence intervals, seeking to generalize fiducial limits, he made it clear that the term confidence coefficient refers to “ probability of our being right when applying a certain rule” for making statements set out in advance (p. 140). Much to Egon Pearson’ s relief, Fisher called Neyman’ s generalization “ a wide and very handsome one,” even though it didn’ t achieve the uniqueness Fisher had wanted (Fisher 1934c , p. 137). There was even a bit of a mutual admiration society, with Fisher saying “ Dr Neyman did him too much honour” in crediting him for the revolutionary insight of Student’ s t pivotal, giving the credit to Student. Neyman (1934, p. 141) responds that of course in calling it Student’ s t he is crediting Student, but “ this does not prevent me from recognizing and appreciating the work of Professor Fisher concerning the same distribution.”
In terms of our famous passage, we may extract this reading: In struggling to extricate Fisher’ s fiducial limits, without slipping into fallacy, they are led to the N-P construal. Since fiducial probability was to apply to significance testing as well as estimation, it stands to reason that the performance notion would find its way into the N-P 1933 paper. 4 So the error probability applies to the method, but the question is whether it’ s intended to qualify a given inference, or only to express future long-run assurance (performance).
N-P and Fisher Dovetail: It’ s Interpretation, not Mathematics
David Cox shows that the Neyman– Pearson theory of tests and confidence intervals arrive at the same place as the Fisherian, even though in a sense they proceed in the opposite direction. Suppose that there is a full model covering both null and alternative possibilities. To establish a significance test, we need to have an appropriate test statistic d( X ) such that the larger the d( X ) the greater the discrepancy with the null hypothesis in the respect of interest. But it is also required that the probability distribution of d( X ) be known under the assumption of the null hypothesis. In focusing on the logic, we’ ve mostly considered just one unknown parameter, e.g., the mean of a Normal distribution. In most realistic cases there are additional parameters required to compute the P -value, sometimes called “ nuisance” parameters λ , although they are just as legitimate as the parameter we happen to be interested in. We’ d like to free the computation of the P -value from these other unknown parameters. This is the error statistician’ s way to ensure as far as possible that observed discordances may be blamed on discrepancies between the null and what’ s actually bringing about the data. We want to solve the classic Duhemian problems of falsification.
As Cox puts it, we want a test statistic with a distribution that is split off from the unknown nuisance parameters, which we can abbreviate as λ . The full parameter space Θ is partitioned into components Θ = ( ψ , λ ), such that the null hypothesis is that ψ = ψ 0 , with λ an unknown nuisance parameter. Interest may focus on alternatives ψ > ψ 0 . We do have information in the data about the unknown parameters, and the natural move is to estimate them using the data. The twin goals of computing the P -value, Pr(d > d0 ; H 0 ), free of unknowns, and constructing tests that are appropriately sensitive, produce the same tests entailed by N-P theory, namely replacing the nuisance parameter by a sufficient statistic V. A statistic V, a sufficient statistic for nuisance parameter λ , means that the probability of the d(X ) conditional on the estimate V depends only on the parameter of interest ψ 0 . So we are back to the simple situation with a null having just a single parameter ψ . This “ largely determines the appropriate test statistic by the requirement of producing the most sensitive test possible with the data at hand” (Cox and Mayo 2010 , p. 292). Cox calls this “ conditioning for separation from nuisance parameters” (ibid.). I draw from Cox and Mayo (2010).
In the most familiar class of cases, this strategy for constructing appropriately sensitive or powerful tests, separate from nuisance parameters, produces the same tests as N-P theory. In fact, when statistic V is a special kind of sufficient statistic for nuisance parameter λ (called complete ), there is no other way of achieving the N-P goal of an exactly α -level test that is fixed regardless of nuisance parameters – these are called similar tests. 5 Thus, replacing the nuisance parameter with a sufficient statistic “ may be regarded as an outgrowth of the aim of calculating the relevant P -value independent of unknowns, or alternatively, as a byproduct of seeking to obtain most powerful similar tests.” These dual ways of generating tests reveal the underpinnings of a substantial part of standard, elementary statistical methods, including key problems about Binomial, Poisson, and Normal distributions, the method of least squares, and linear models. 6 (ibid., p. 293)
If you begin from the “ three steps” in test generation described by E. Pearson in the opening to Section 3.2 , rather than the later N-P– Wald approach, they’ re already starting from the same point. The only difference is in making the alternative explicit. Fisher (1934b ) made the connection to the N-P (1933) result on uniformly most powerful tests:
… where a sufficient statistic exists, the likelihood, apart from a factor independent of the parameter to be estimated, is a function only of the parameter and the sufficient statistic, explains the principal result obtained by Neyman and Pearson in discussing the efficacy of tests of significance. Neyman and Pearson introduce the notion that any chosen test of a hypothesis H 0 is more powerful than any other equivalent test, with regard to an alternative hypothesis H 1 , when it rejects H 0 in a set of samples having an assigned aggregate frequency ε when H 0 is true, and the greatest possible aggregate frequency when H 1 is true…
(pp. 294– 5)
It is inevitable, therefore, that if such a statistic exists it should uniquely define the contours best suited to discriminate among hypotheses differing only in respect of this parameter; … When tests are considered only in relation to sets of hypotheses specified by one or more variable parameters, the efficacy of the tests can be treated directly as the problem of estimation of these parameters. Regard for what has been established in that theory, apart from the light it throws on the results already obtained by their own interesting line of approach, should also aid in treating the difficulties inherent in cases in which no sufficient statistics exists.
(ibid., p. 296)
This article may be seen to mark the point after which Fisher’ s attitude changes because of the dust-up with Neyman.
Neyman and Pearson come to Fisher’ s Rescue
Neyman and Pearson entered the fray on Fisher’ s side as against the old guard (led by K. Pearson) regarding the key point of contention: showing statistical inference is possible without the sin of “ inverse inference” . Fisher denounced the principle of indifference : “ We do not know the function … specifying the super-population, but in view of our ignorance of the actual values of θ we may” take it that all values are equally probable (Fisher 1930 , p. 531). “ [B]ut however we might disguise it, the choice of this particular a priori distribution for the θ is just as arbitrary as any other…” (ibid.).
If, then, we follow writers like Boole, Venn, … in rejecting the inverse argument as devoid of foundation and incapable even of consistent application, how are we to avoid the staggering falsity of saying that however extensive our knowledge of the values of x … we know nothing and can know nothing about the values of θ ?
(ibid.)
W
hen Fisher gave his paper in December 1934 (“ The Logic of Inductive Inference” ), the old guard were ready with talons drawn to attack his ideas, which challenged the overall philosophy of statistics they embraced. The opening thanks (by Arthur Bowley), which is typically a flowery, flattering affair, was couched in scathing, sarcastic terms (see Fisher 1935b , pp. 55– 7). To Fisher’ s support came Egon Pearson and Jerzy Neyman. Neyman dismissed “ Bowley’ s reaction to Fisher’ s critical review of the traditional view of statistics as an understandable attachment to old ideas (1935, p. 73)” (Spanos 2008b , p. 16). Fisher agreed: “ However true it may be that Professor Bowley is left very much where he was, the quotations show at least that Dr. Neyman and myself have not been left in his company” (1935a , p. 77).
So What Happened in 1935?
A pivotal event was a paper Neyman gave in which he suggested a different way of analyzing one of Fisher’ s experimental designs. Then there was a meet-up in the hallway a few months later. Fisher stops by Neyman’ s office at University College, on his way to a meeting which was to decide on Neyman’ s reappointment in 1935:
And he said to me that he and I are in the same building … That, as I know, he has published a book – and that’ s Statistical Methods for Research Workers – and he is upstairs from me so he knows something about my lectures – that from time to time I mention his ideas, this and that – and that this would be quite appropriate if I were not here in the College but, say, in California … but if I am going to be at University College, then this is not acceptable to him. And then I said, ‘ Do you mean that if I am here, I should just lecture using your book?’ And then he gave an affirmative answer. … And I said, ‘ Sorry, no. I cannot promise that.’ And then he said, ‘ Well, if so, then from now on I shall oppose you in all my capacities.’ And then he enumerated – member of the Royal Society and so forth. There were quite a few. Then he left. Banged the door.