Statistical Inference as Severe Testing
Page 54
(Neyman in C. Reid 1998 , p. 126)
Imagine if Neyman had replied: ‘ I’ d be very pleased to use Statistical Methods for Research Workers in my class.’ Or what if Fisher had said: ‘ Of course you’ ll want to use your own notes in your class, but I hope you will use a portion of my text when mentioning some of its key ideas.’ Never mind. That was it. Fisher went on to a meeting wherein he attempted to get others to refuse Neyman a permanent position, but was unsuccessful. It wasn’ t just Fisher who seemed to need some anger management training, by the way. Erich Lehmann (in conversation and in 2011) points to a number of incidences wherein Neyman is the instigator of gratuitous ill-will. I find it hard to believe, however, that Fisher would have thrown Neyman’ s wooden models onto the floor.
One evening, late that spring, Neyman and Pearson returned to their department after dinner to do some work. Entering they were startled to find strewn on the floor the wooden models which Neyman had used to illustrate his talk … Both Neyman and Pearson always believed that the models were removed by Fisher in a fit of anger.
(C. Reid 124, noted in Lehmann 2011 , p. 59)
Neyman left soon after to start the program at Berkeley (1939), and Fisher didn’ t remain long either, moving in 1943 to Cambridge and retiring in 1957 to Adelaide. I’ ve already been disabusing you of the origins of the popular Fisher– N-P conflict (Souvenir L). In fact, it really only made an appearance long after the 1933 paper!
1955– 6 Triad: Telling What’ s True About the Fisher– Neyman Conflict
If you want to get an idea of what transpired in the ensuing years, look at Fisher’ s charges and Neyman’ s and Pearson’ s responses 20 years later. This forms our triad: Fisher (1955 ), Pearson (1955 ), and Neyman (1956 ). Even at the height of mudslinging, Fisher said, “ There is no difference to matter in the field of mathematical analysis … but in logical point of view” (1955 , p. 70).
I owe to Professor Barnard … the penetrating observation that this difference in point of view originated when Neyman, thinking he was correcting and improving my own early work on tests of significance as a means to the ‘ improvement of natural knowledge,’ in fact reinterpreted them in terms of that technological and commercial apparatus which is known as an acceptance procedure. … Russians are made familiar with the ideal that research in pure science can and should be geared to technological performance.
(ibid., pp. 69– 70)
Pearson’ s (1955 ) response: “ To dispel the picture of the Russian technological bogey, I might recall how certain early ideas came into my head as I sat on a gate overlooking an experimental blackcurrant plot … !” (Pearson 1955 , p. 204). He was “ smitten” by an absence of logical justification for some of Fisher’ s tests, and turned to Neyman to help him solve the problem. This takes us to where we began with our miserable passages, leading them to pin down the required character for the test statistic, the need for the alternative and power considerations.
Until you disinter the underlying source of the problem – fiducial inference – the “ he said/he said” appears to be all about something that it’ s not. The reason Neyman adopts a performance formulation, Fisher (1955 ) charges, is that he denies the soundness of fiducial inference. Fisher thinks Neyman is wrong because he “ seems to claim that the statement (a) ‘ μ has a probability of 5 per cent. of exceeding ’ is a different statement from (b) ‘ has a probability of 5 per cent. of falling short of μ ’” (p. 74, replacing θ and T with μ and ). There’ s no problem about equating these two so long as is a random variable. But watch what happens in the next sentence. According to Fisher, Neyman violates
… the principles of deductive logic [by accepting a] general symbolical statement such as
as rigorously demonstrated, and yet, when numerical values are available for the statistics and s , so that on substitution of these and use of the 5 per cent. value of t , the statement would read
[2] Pr {92.99 < μ < 93.01} = 95 per cent.,
to deny to this numerical statement any validity. This evidently is to deny the syllogistic process.
(Fisher 1955 , p. 75, in Neyman 1956, p. 291)
But the move from (1) to (2) is fallacious! Is Fisher committing this fallacious probabilistic instantiation (and still defending it in 1955)? I. J. Good describes how many felt, and still feel:
It seems almost inconceivable that Fisher should have made the error which he did in fact make. [That is why] … so many people assumed for so long that the argument was correct. They lacked the daring to question it.
(Good 1971a , p. 138).
Neyman (1956 ) declares himself at his wit’ s end in trying to convince Fisher of the inconsistencies in moving from (1) to (2). “ Thus if X is a normal random variable with mean zero and an arbitrary variance greater than zero, then I expect” we may agree that Pr( X < 0) = 0.5 (ibid., p. 292). But observing, say, X = 1.7 yields Pr(1.7 < 0) = 0.5 , which is clearly illicit. “ It is doubtful whether the chaos and confusion now reigning in the field of fiducial argument were ever equaled in any other doctrine. The source of this confusion is the lack of realization that equation (1) does not imply (2)” (ibid., p. 293). It took the more complex example of Bartlett to demonstrate the problem: “ Bartlett’ s revelation [1936 , 1939 ] that the frequencies in repeated sampling [from the same or different populations] need not agree with Fisher’ s solution” in the case of a difference between two Normal means with different variances, “ brought about an avalanche of rebuttals by Fisher and by Yates” (ibid., p. 292). 7 Some think it was only the collapse of Fisher’ s rebuttals that led Fisher to castigate N-P for assuming error probabilities and fiducial probabilities ought to agree, and begin to declare the idea “ foreign to the development of tests of significance.” As statistician Sandy Zabell (1992 p. 378) remarks, “ such a statement is curiously inconsistent with Fisher’ s own earlier work” as in Fisher’ s (1934b ) endorsement of UMP tests, and his initial attitude toward Neyman’ s confidence intervals. Because of Fisher’ s stubbornness “ he engaged in a futile and unproductive battle with Neyman which had a largely destructive effect on the statistical profession” (ibid., p. 382). 8
Fisher (1955 ) is spot on about one thing: When “ Neyman denies the existence of inductive reasoning, he is merely expressing a verbal preference. For him ‘ reasoning’ means what ‘ deductive reasoning’ means to others” (p. 74). Nothing earth-shaking turns on the choice to dub every inference “ an act of making an inference.” Neyman, much like Popper, had a good reason for drawing a bright red line between the use of probability (for corroboration or probativeness) and the probabilists’ use of confirmation: Fisher was blurring them.
… the early term I introduced to designate the process of adjusting our actions to observations is ‘ inductive behavior’ . It was meant to contrast with the term ‘ inductive reasoning’ which R. A. Fisher used in connection with his ‘ new measure of confidence or diffidence’ represented by the likelihood function and with ‘ fiducial argument’ . Both these concepts or principles are foreign to me.
(Neyman 1977 , p. 100)
The Fisher– Neyman dispute is pathological: there’ s no disinterring the truth of the matter. Perhaps Fisher altered his position out of professional antagonisms toward the new optimality revolution. Fisher’ s stubbornness on fiducial intervals seems to lead Neyman to amplify the performance construal louder and louder; whereas Fisher grew to renounce performance goals he himself had held when it was found that fiducial solutions disagreed with them. Perhaps inability to identify conditions wherein the error probabilities “ rubbed off” – where there are no “ recognizable subsets” with a different probability of success – led Fisher to move to a type of default Bayesian stance. That Neyman (with the contributions of Wald, and later Robbins) might have gone overboard in his behaviorism, to the extent that even Egon wanted to divorce him – ending his 1955 reply to Fisher with the claim that “ inductive behavior” was Neyman’ s field, not his – is a different matter
. Ironically, Pearson shared Neyman’ s antipathy to “ inferential theory” as Neyman (1962) defines it in the following:
In the present paper … the term ‘ inferential theory’ … will be used to describe the attempts to solve the Bayes’ problem with a reference to confidence, beliefs, etc., through some supplementation … either a substitute a priori distribution [exemplified by the so called principle of insufficient reason] or a new measure of uncertainty [such as Fisher’ s fiducial probability].
(p. 16)
Fisher may have started out seeing fiducial probability as both a frequency of correct claims in an aggregate, and a rational degree of belief (1930 , p. 532), but the difficulties in satisfying uniqueness led him to give up the former. Fisher always showed inductive logic leanings, seeking a single rational belief assignment. N-P were allergic to the idea. In the N-P philosophy, if there is a difference in problems or questions asked, we expect differences in which solutions are warranted. This is in sync with the view of the severe tester. In this sense, she is closer to Fisher’ s viewing the posterior distribution to be an answer to a different problem from the fiducial limits, where we expect the sample to change (Fisher 1930 , p. 535).
Bridges to Fiducial Island: Neymanian Interpretation of Fiducial Inference?
For a long time Fiducial Island really was an island, with work on it side-stepped. A notable exception is Donald Fraser. Fraser will have no truck with those who dismiss fiducial inference as Fisher’ s “ biggest blunder.” “ What? We still have to do a little bit of thinking! Tough!” (Fraser 2011 , p. 330). Now, however, bridges are being built, despite minefields. Numerous programs are developing confidence distributions (CDs), and the impenetrable thickets are being penetrated. The word “ fiducial” is even bandied about in these circles. 9 Singh, Xie, and Strawderman (2007 ) say, “ a CD is in fact Neymanian interpretation of Fisher’ s fiducial distribution” (p. 132).
“ [A]ny approach that can build confidence intervals for all levels, regardless of whether they are exact or asymptotically justified, can potentially be unified under the confidence distribution framework” (Xie and Singh 2013, p. 5). Moreover, “ as a frequentist procedure, the CD-based method can bypass [the] difficult task of jointly modelling [nuisance parameters] and focus directly on the parameter of interest” (p. 28). This turns on what we’ ve been calling the piecemeal nature of error statistics. “ The idea that statistical problems do not have to be solved as one coherent whole is anathema to Bayesians but is liberating for frequentists” (Wasserman 2007, p. 261).
I’ m not in a position to evaluate these new methods, or whether they lend themselves to a severity interpretation. The CD program does at least seem to show the wide landscape for which the necessary mathematical computations are attainable. While CDs do not supply the uniqueness that Fisher sought, given that a severity assessment is always relative to the question or problem of interest, this is no drawback. Nancy Reid claims the literature on the new frequentist– fiducial “ fusions” isn’ t yet clear on matters of interpretation. 10 What is clear, is that the frequentist paradigm is undergoing the “ historical process of development … which is and will always go on” of which Pearson spoke (1962 , p. 394).
Back to the ship!
1 The pagination is from the Selected Works of E.L. Lehmann (2012 ).
2 To get a fiducial distribution, the case has to be continuous.
3 It’ s correct that ( ) iff ( ).
4 “ [C]onsider that variation of the unknown parameter, μ , generates a continuum of hypotheses each of which might be regarded as a null hypothesis … [T]he data of the experiment, and the test of significance based upon them, have divided the continuum into two portions.” One a region in which μ lies between the fixed fiducial limits, “ is accepted by the test of significance, in the sense that values of μ within this region are not contradicted by the data at the level of significance chosen. The remainder… is rejected” (Fisher 1935a, p. 192).
5 The goal of exactly similar tests leads to tests that ensure
Pr(d( X ) is significant at level α |v; H 0 ) = α ,
where v is the value of the statistic V used to estimate the nuisance parameter. A good summary may be found in Lehmann (1981 ).
6 Requiring exactly similar rejection regions, “ precludes tests that merely satisfy the weaker requirement of being able to calculate P approximately, with only minimal dependence on nuisance parameters,” which could be preferable especially when best tests are absent. (Ibid. )
7 In that case, “ the test rejects a smaller proportion of such repeated samples than the proportion specified by the level of significance” (Fisher 1939, p. 173a). Prakash Gorroochurn (2016) has a masterful historical discussion.
8 Buehler and Feddersen (1963 ) showed there were recognizable subsets even for the t test.
9 Efron predicts “ that the old Fisher will have a very good 21st century. The world of applied statistics seems to need an effective compromise between Bayesian and frequentist ideas” (Efron 1998 , p. 112).
10 The 4th Bayesian, Fiducial and Frequentist workshop (BFF4), May 2017. Other examples are Fraser and Reid (2002 ), Hannig (2009 ), Martin and Liu (2013 ), Schweder and Hjort (2016 ).
Excursion 6
(Probabilist) Foundations Lost, (Probative) Foundations Found
Itinerary
Tour I What Ever Happened to Bayesian Foundations? 6.1 Bayesian Ways: From Classical to Default
6.2 What are Bayesian Priors? A Gallimaufry
6.3 Unification or Schizophrenia: Bayesian Family Feuds
6.4 What Happened to Updating by Bayes’ Rule?
Tour II Pragmatic and Error Statistical Bayesians 6.5 Pragmatic Bayesians
6.6 Error Statistical Bayesians: Falsificationist Bayesians
6.7 Farewell Keepsake
Tour I
What Ever Happened to Bayesian Foundations?
By and large, Statistics is a prosperous and happy country, but it is not a completely peaceful one. Two contending philosophical parties, the Bayesians and the frequentists, have been vying for supremacy over the past two-and-a-half centuries. … Unlike most philosophical arguments, this one has important practical consequences. The two philosophies represent competing visions of how science progresses.
(Efron 2013 , pp. 130; emphasis added)
Surveying the statistical landscape from a hot-air balloon this morning, a bird’ s-flight view of the past 100 of those 250 years unfolds before us. Except for the occasional whooshing sound of the balloon burner, it’ s quiet enough to actually hear some of the warring statistical tribes as well as peace offerings and reconciliations – at least with a special sound amplifier they’ ve supplied. It’ s today’ s perspective I mainly want to show you from here. Arrayed before us is a most impressive smorgasbord of technical methods, as statistics expands over increasing territory. Many professional statisticians are eclecticists; foundational discussions are often in very much of a unificationist spirit. If you observe the territories undergoing recent statistical crises, you can see pockets, growing in number over the past decade or two, who are engaged in refighting old battles. Unsurprisingly, the methods most often used are the ones most often blamed for abuses. Perverse incentives, we hear, led to backsliding, to slothful, cronyist uses of significance tests that have been deplored for donkey’ s years. Big Data may have foisted statistics upon fields unfamiliar with the pitfalls stemming from vast numbers of correlations and multiple testing. A pow-wow of leading statisticians from different tribes was called by the American Statistical Association in 2015. We’ ve seen the ASA 2016 Guide on how not to use P -values, but some of the “ other approaches” also call for scrutiny:
In view of the prevalent misuses of and misconceptions concerning p -values, some statisticians prefer to supplement or even replace p -values with other approaches. … confidence, credibility, or prediction intervals; Bayesian methods; … likelihood ratios or Bayes Factors; and other approaches such as decision-theor
etic modeling and false discovery rates.
(Wasserstein and Lazar 2016 , p. 132)
Suppose you’ re appraising a recommendation that frequentist methods should or can be replaced by a Bayesian method. Your first question should be: Which type of Bayesian interpretation? The choices are basically three: subjectivist, default, or frequentist. The problem isn’ t just choosing amongst them but trying to pin down the multiple meanings being given to each! Classical subjective Bayesianism is home to a full-bodied statistical philosophy, but the most popular Bayesians live among rival tribes who favor one or another default or non-subjective prior probabilities. These are conventions chosen to ensure the data dominate the inference in some sense. By and large, these tribes do not see the growth of Bayesian methods as support for the classical subjective Bayesian philosophy, but rather as a set of technical tools that “ work.” Their leaders herald frequentist– Bayesian unifications as the way to serve multiple Gods. Zeus is throwing a thunderbolt!