Statistical Inference as Severe Testing
Page 19
Clearly, Barnard took Fisher’ s side in the N-P vs. Fisher disputes; he wanted me to know he was the one responsible for telling Fisher that Neyman had converted “ his” significance tests into tools for acceptance sampling, where only long-run performance matters (Pearson 1955 affirms this). Pearson was kept out of it. The set of hypothetical repetitions used in obtaining the relevant error probability, in Barnard’ s view, should consist of “ results of reasonably similar precision” (1971 , p. 300). This is a very interesting idea, and it will come up again.
Big Picture Inference: Can Other Hypotheses Explain the Observed Deflection?
Even to the extent that they had found a deflection effect, it would have been fallacious to infer the effect “ attributable to the sun’ s gravitational field.” The question (ii) must be tackled: A statistical effect is not a substantive effect. Addressing the causal attribution demands the use of the eclipse data as well as considerable background information. Here we’ re in the land of “ big picture” inference: the inference is “ given everything we know” . In this sense, the observed effect is used and is “ non-novel” (in the use-novel sense). Once the deflection effect was known, imprecise as it was, it had to be used. Deliberately seeking a way to explain the eclipse effect while saving Newton’ s Law of Gravity from falsification isn’ t the slightest bit pejorative – so long as each conjecture is subject to severe test. Were any other cause to exist that produced a considerable fraction of the deflection effect, that alone would falsify the Einstein hypothesis (which asserts that all of the 1.75″ are due to gravity) (Jeffreys 1919 , p. 138). That was part of the riskiness of the GTR prediction.
It’ s Not How Plausible, but How Well Probed
One famous case was that of Sir Oliver Lodge and his proposed “ ether effect.” Lodge was personally invested in the Newtonian ether, as he believed it was through the ether that he was able to contact departed souls, in particular his son, Raymond. Lodge had “ preregistered” in advance that if the eclipse results showed the Einstein deflection he would find a way to give a Newtonian explanation (Lodge 1919 ). Others, without a paranormal bent, felt a similar allegiance to Newton. “ We owe it to that great man to proceed very carefully in modifying or retouching his Law of Gravitation” (Silberstein 1919 , p. 397). But respect for Newton was kept out of the data analysis. They were free to try and try again with Newton-saving factors because, unlike in pejorative seeking, it would be extremely difficult for any such factor to pass if false – given the standards available and insisted on by the relevant community of scientists. Each Newton-saving hypothesis collapsed on the basis of a one-two punch: the magnitude of effect that could have been due to the conjectured factor is far too small to account for the eclipse effect; and were it large enough to account for the eclipse effect, it would have blatantly false or contradictory implications elsewhere. Could the refraction of the sun’ s corona be responsible (as one scientist proposed)? Were it sufficient to explain the deflection, then comets would explode when they pass near the sun, which they do not! Or take another of Lodge’ s ether modification hypotheses. As scientist Lindemann put it:
Sir Oliver Lodge has suggested that the deflection of light might be explained by assuming a change in the effective dielectric constant near a gravitating body. … It sounds quite promising at first … The difficulty is that one has in each case to adopt a different constant in the law, giving the dielectric constant as a function of the gravitational field, unless some other effect intervenes.
(1919 , p. 114)
This would be a highly insevere way to retain Newton. These criticisms combine quantitative and qualitative severity arguments. We don’ t need a precise quantitative measure of how frequently we’ d be wrong with such ad hoc finagling. The Newton-saving factors might have been plausible but they were unable to pass severe tests. Saving Newton this way would be bad science.
As is required under our demarcation (Section 2.3): the 1919 players were able to embark upon an inquiry to pinpoint the source for the Newton anomaly. By 1921, it was recognized that the deflection effect was real, though inaccurately measured. Further, the effects revealed (corona effect, shadow effect, lens effect) were themselves used to advance the program of experimental testing of GTR. For instance, learning about the effect of the sun’ s corona (corona effect) not only vouchsafed the eclipse result, but pointed to an effect that could not be ignored in dealing with radioastronomy. Time and space prevents going further, but I highly recommend you return at a later time. For discussion and references, see Mayo (1996 , 2010a , e).
The result of all the analysis was merely evidence of a small piece of GTR: an Einstein-like deflection effect. The GTR “ passed” the test, but clearly they couldn’ t infer GTR severely. Even now, only its severely tested parts are accepted, at least to probe relativistic gravity. John Earman, in criticism of me, observes:
[W]hen high-level theoretical hypotheses are at issue, we are rarely in a position to justify a judgment to the effect that Pr(E |~H & K ) ≪ 0.5. If we take H to be Einstein’ s general theory of relativity and E to be the outcome of the eclipse test, then in 1918 and 1919 physicists were in no position to be confident that the vast and then unexplored space of possible gravitational theories denoted by ~GTR does not contain alternatives to GTR that yield that same prediction for the bending of light as GTR.
(Earman 1992 , p. 117)
A similar charge is echoed by Laudan (1997 ), Chalmers (2010 ), and Musgrave (2010 ). For the severe tester, being prohibited from regarding GTR as having passed severely – especially in 1918 and 1919 – is just what an account ought to do. (Do you see how this relates to our treatment of irrelevant conjunctions in Section 2.2 ?)
From the first exciting results to around 1960, GTR lay in the doldrums. This is called the period of hibernation or stagnation. Saying it remained uncorroborated or inseverely tested does not mean GTR was deemed scarcely true, improbable, or implausible. It hadn’ t failed tests, but there were too few link-ups between the highly mathematical GTR and experimental data. Uncorroborated is very different from disconfirmed. We need a standpoint that lets us express being at that stage in a problem, and viewing inference as severe testing gives us one. Soon after, things would change, leading to the Renaissance from 1960 to 1980. We’ ll pick this up at the end of Sections 3.2 and 3.3 . To segue into statistical tests, here’ s a souvenir.
Souvenir I: So What Is a Statistical Test, Really?
So what’ s in a statistical test? First there is a question or problem, a piece of which is to be considered statistically, either because of a planned experimental design, or by embedding it in a formal statistical model. There are (A) hypotheses, and a set of possible outcomes or data; (B) a measure of accordance or discordance, fit, or misfit, d( X ) between possible answers (hypotheses) and data; and (C) an appraisal of a relevant distribution associated with d( X ). Since we want to tell what’ s true about tests now in existence, we need an apparatus to capture them, while also offering latitude to diverge from their straight and narrow paths.
(A) Hypotheses . A statistical hypothesis Hi is generally couched in terms of an unknown parameter θ . It is a claim about some aspect of the process that might have generated the data, x 0 = (x 1 , … , xn ), given in a model of that process. Statistical hypotheses assign probabilities to various outcomes x “ computed under the supposition that Hi is correct (about the generating mechanism).” That is how to read f ( x ; Hi ), or as I often write it: Pr( x ; Hi ). This is just an analytic claim about the assignment of probabilities to x stipulated in Hi .
In the GTR example, we consider n IID Normal random variables: (X 1 , … , Xn ) that are N( μ , σ 2 ). Nowadays, the GTR value for λ = μ is set at 1, and the test might be of H 0 : μ ≤ 1 vs. H : μ > 1. The hypothesis of interest will typically be a claim C posed after the data, identified within the predesignated parameter spaces.
(B) Distance function and its distribution . A function of the sample d( X ), the test
statistic , reflects how well or poorly the data ( X = x 0 ) accord with the hypothesis H 0 , which serves as a reference point. The term “ test statistic” is generally reserved for statistics whose distribution can be computed under the main or test hypothesis. If we just want to speak of a statistic measuring distance, we’ ll call it that.
It is the observed distance d( x 0 ) that is described as “ significantly different” from the null hypothesis H 0 . I use x to say something general about the data, whereas x 0 refers to a fixed data set.
(C) Test rule T . Some interpretative move or methodological rule is required for an account of inference. One such rule might be to infer that x is evidence of a discrepancy δ from H 0 just when d( x ) ≥ c , for some value of c . Thanks to the requirement in (B), we can calculate the probability that {d( X ) ≥ c } under the assumption that H 0 is true. We want also to compute it under various discrepancies from H 0 , whether or not there’ s an explicit specification of H 1 . Therefore, we can calculate the probability of inferring evidence for discrepancies from H 0 when in fact the interpretation would be erroneous. Such an error probability is given by the probability distribution of d( X ) – its sampling distribution – computed under one or another hypothesis.
To develop an account adequate for solving foundational problems, special stipulations and even reinterpretations of standard notions may be required. (D) and (E) reflect some of these.
(D) A key role of the distribution of d( X ) will be to characterize the probative abilities of the inferential rule for the task of unearthing flaws and misinterpretations of data. In this way, error probabilities can be used to assess the severity associated with various inferences. We are able to consider outputs outside the N-P and Fisherian schools, including “ report a Bayes ratio” or “ infer a posterior probability” by leaving our measure of agreement or disagreement open. We can then try to compute an associated error probability and severity measure for these other accounts.
(E) Empirical background assumptions . Quite a lot of background knowledge goes into implementing these computations and interpretations. They are guided by the goal of assessing severity for the primary inference or problem, housed in the manifold steps from planning the inquiry, to data generation and analyses.
We’ ve arrived at the N-P gallery, where Egon Pearson (actually a hologram) is describing his and Neyman’ s formulation of tests. Although obviously the museum does not show our new formulation, their apparatus is not so different.
3.2 N-P Tests: An Episode in Anglo-Polish Collaboration
We proceed by setting up a specific hypothesis to test, H 0 in Neyman’ s and my terminology, the null hypothesis in R. A. Fisher’ s … in choosing the test, we take into account alternatives to H 0 which we believe possible or at any rate consider it most important to be on the look out for … Three steps in constructing the test may be defined:
Step 1. We must first specify the set of results …
Step 2. We then divide this set by a system of ordered boundaries …
such that as we pass across one boundary and proceed to the next, we come to a class of results which makes us more and more inclined, on the information available, to reject the hypothesis tested in favour of alternatives which differ from it by increasing amounts.
Step 3. We then, if possible, associate with each contour level the chance that, if H 0 is true, a result will occur in random sampling lying beyond that level …
In our first papers [in 1928] we suggested that the likelihood ratio criterion, λ , was a very useful one … Thus Step 2 proceeded Step 3. In later papers [1933– 1938] we started with a fixed value for the chance, ε , of Step 3 … However, although the mathematical procedure may put Step 3 before 2, we cannot put this into operation before we have decided, under Step 2, on the guiding principle to be used in choosing the contour system. That is why I have numbered the steps in this order.
(Egon Pearson 1947 , p. 173)
In addition to Pearson’ s 1947 paper, the museum follows his account in “ The Neyman– Pearson Story: 1926– 34” (Pearson 1970 ). The subtitle is “ Historical Sidelights on an Episode in Anglo-Polish Collaboration” !
We meet Jerzy Neyman at the point he’ s sent to have his work sized up by Karl Pearson at University College in 1925/26. Neyman wasn’ t that impressed:
Neyman found … [K.]Pearson himself surprisingly ignorant of modern mathematics. (The fact that Pearson did not understand the difference between independence and lack of correlation led to a misunderstanding that nearly terminated Neyman’ s stay … )
(Lehmann 1988, p. 2)
Thus, instead of spending his second fellowship year in London, Neyman goes to Paris where his wife Olga (“ Lola” ) is pursuing a career in art, and where he could attend lectures in mathematics by Lebesque and Borel. “ [W]ere it not for Egon Pearson [whom I had briefly met while in London], I would have probably drifted to my earlier passion for [pure mathematics]” (Neyman quoted in Lehmann 1988, p. 3).
What pulled him back to statistics was Egon Pearson’ s letter in 1926. E. Pearson had been “ suddenly smitten” with doubt about the justification of tests then in use, and he needed someone with a stronger mathematical background to pursue his concerns. Neyman had just returned from his fellowship years to a hectic and difficult life in Warsaw, working multiple jobs in applied statistics.
[H]is financial situation was always precarious. The bright spot in this difficult period was his work with the younger Pearson. Trying to find a unifying, logical basis which would lead systematically to the various statistical tests that had been proposed by Student and Fisher was a ‘ big problem’ of the kind for which he had hoped …
(ibid., p. 3)
N-P Tests: Putting Fisherian Tests on a Logical Footing
For the Fisherian simple or “ pure” significance test, alternatives to the null “ lurk in the undergrowth but are not explicitly formulated probabilistically” (Mayo and Cox 2006 , p. 81). Still there are constraints on a Fisherian test statistic. Criteria for the test statistic d( X ) are
(i) it reduces the data as much as possible;
(ii) the larger d( x 0 ) the further the outcome from what’ s expected under H 0 , with respect to the particular question;
(iii) the P -value can be computed p ( x 0 )=Pr(d( X ) ≥ d( x 0 ) ; H 0 ).
Fisher, arch falsificationist, sought test statistics that would be sensitive to discrepancies from the null. Desiderata (i)– (iii) are related, as emerged clearly from N-P’ s work.
Fisher introduced the idea of a parametric statistical model, which may be written M θ ( x ). Karl Pearson and others had been prone to mixing up a parameter θ , say the mean of a population, with a sample mean . As a result, concepts that make sense for statistic X̅ , like having a distribution, were willy-nilly placed on a fixed parameter θ . Neyman and Pearson [N-P] gave mathematical rigor to the components of Fisher’ s tests and estimation. The model can be represented as a pair (S, Θ ) where S denotes the set of all possible values of the sample X = (X 1 , … , Xn ) – one such value being the data x 0 = (x 1 , … , xn ) – and Θ denotes the set of all possible values of the unknown parameter(s) θ . In hypothesis testing, Θ is used as shorthand for the family of probability distributions or, in continuous cases, densities indexed by θ . Without the abbreviation, we’ d write the full model as
M θ ( x ) ≔ {f( x ; θ ), θ ∈ Θ },
where f( x ; θ ), for all x ∈ S, is the distribution (or density) of the sample. We don’ t test all features of the model at once; it’ s part of the test specification to indicate which features (parameters) of the model are under test. The generic form of null and alternative hypotheses is
H 0 : θ ∈ Θ 0 vs. H 1 : θ ∈ Θ 1 ,
where (Θ 0 , Θ 1 ) constitute subsets of Θ that partition Θ . Together, Θ 0 and Θ 1 exhaust the parameter space. N-P called H 0 the test hypothesis , which is preferable to null hypothesis, since for them it’ s on par with alternative H 1 ; but for brevity an
d familiarity, I mostly call H 0 the null. I follow A. Spanos’ treatment.
Lambda Criterion
What were Neyman and Pearson looking for in their joint work from 1928? They sought a criterion for choosing, as well as generating, sensible test statistics. Working purely on intuition, which they later imbued with a justification, N-P employ the likelihood ratio. Pearson found the spark of the idea from correspondence with Gosset, known as Student, but we will see that generating good tests requires much more than considering alternatives.
How can we consider the likelihood ratio of hypotheses when one or both can contain multiple values of the parameter? They consider the maximum values that the likelihood could take over ranges of the parameter space. In particular, they take the maximum likelihood over all possible values of θ in the entire parameter space Θ (not Θ 1 ), and compare it to the maximum over the restricted range of values in Θ 0 , to form the ratio