Statistical Inference as Severe Testing
Page 23
(ibid.) 9
The bottom line is, there was no failed replication; there was one set of eclipse data that was unusable.
5. Substantively based hypotheses. We know it’ s fallacious to take a statistically significant result as evidence in affirming a substantive theory, even if that theory predicts the significant result. A qualitative version of FEV, or, equivalently, an appeal to severity, underwrites this. Can failing to reject statistical null hypotheses ever inform about substantive claims? Yes. First consider how, in the midst of auditing, there’ s a concern to test a claim: is an apparently anomalous result real or spurious?
Finding cancer clusters is sometimes compared to our Texas Marksman drawing a bull’ s-eye around the shots after they were fired into the barn. They often turn out to be spurious. Physical theory, let’ s suppose, suggests that because the quantum of energy in non-ionizing electromagnetic fields, such as those from high-voltage transmission lines, is much less than is required to break a molecular bond, there should be no carcinogenic effect from exposure to such fields. Yet a cancer association was reported in 1979 (Wertheimer and Leeper 1979 ). Was it real? In a randomized experiment where two groups of mice are under identical conditions except that one group is exposed to such a field, the null hypothesis that the cancer incidence rates in the two groups are identical may well be true. Testing this null is a way to ask: was the observed cancer cluster really an anomaly for the theory? Were the apparently anomalous results for the theory genuine, it is expected that H 0 would be rejected, so if it’ s not, it cuts against the reality of the anomaly. Cox gives this as one of the few contexts where a reported small P -value alone might suffice.
This wouldn’ t entirely settle the issue, and our knowledge of such things is always growing. Nor does it, in and of itself, show the flaw in any studies purporting to find an association. But several of these pieces taken together can discount the apparent effect with severity. It turns out that the initial researchers in the 1979 study did not actually measure magnetic fields from power lines; when they were measured no association was found. Instead they used the wiring code in a home as a proxy. All they really showed, it may be argued, was that people who live in the homes with poor wiring code tend to be poorer than the control (Gurney et al. 1996 ). The study was biased. Twenty years of study continued to find negative results (Kheifets et al. 1999 ). The point just now is not when to stop testing – more of a policy decision – or even whether to infer, as they did, that there’ s no evidence of a risk, and no theoretical explanation of how there could be. It is rather the role played by a negative statistical result, given the background information that, if the effects were real, these tests were highly capable of finding them. It amounts to a failed replication (of the observed cluster), but with a more controlled method. If a well-controlled experiment fails to replicate an apparent anomaly for an independently severely tested theory, it indicates the observed anomaly is spurious. The indicated severity and potential gaps are recorded; the case may well be reopened. Replication researchers might take note.
Another important category of tests that Cox develops, is what he calls testing discrete families of models , where there’ s no nesting. In a nutshell, each model is taken in turn to assess if the data are compatible with one, both, or neither of the possibilities (Cox 1977 , p. 59). Each gets its own severity assessment.
Who Says You Can’ t Falsify Alternatives in a Significance Test?
Does the Cox– Mayo formulation of tests change the logic of significance tests in any way? I don’ t think so and neither does Cox. But it’ s different from some of the common readings. Nothing turns on whether you wish to view it as a revised account. SEV goes a bit further than FEV, and I do not saddle Cox with it. The important thing is how you get a nuanced interpretation, and we have barely begun our travels! Note the consequences for a familiar bugaboo about falsifying alternatives to significance tests. Burnham and Anderson (2014 ) make a nice link with Popper:
While the exact definition of the so-called ‘‘ scientific method’’ might be controversial, nearly everyone agrees that the concept of ‘‘ falsifiability’’ is a central tenant [sic] of empirical science (Popper 1959 ). It is critical to understand that historical statistical approaches (i.e., P values) leave no way to ‘‘ test’’ the alternative hypothesis. The alternative hypothesis is never tested, hence cannot be rejected or falsified! … Surely this fact alone makes the use of significance tests and P values bogus. Lacking a valid methodology to reject/falsify the alternative science hypotheses seems almost a scandal.
(p. 629)
I think we should be scandalized. But not for the reasons alleged. Fisher emphasized that, faced with a non-significant result, a researcher’ s attitude wouldn’ t be full acceptance of H 0 but, depending on the context, more like the following:
The possible deviation from truth of my working hypothesis, to examine which test is appropriate, seems not to be of sufficient magnitude to warrant any immediate modification.
Or, … the body of data available so far is not by itself sufficient to demonstrate their [the deviations] reality.
(Fisher 1955 , p. 73)
Our treatment cashes out these claims, by either indicating the magnitudes ruled out statistically, or inferring that the observed difference is sufficiently common, even if spurious.
If you work through the logic, you’ ll see that in each case of the taxonomy the alternative may indeed be falsified. Perhaps the most difficult one is ruling out model violations, but this is also the one that requires a less severe test, at least with a reasonably robust method of inference. So what do those who repeat this charge have in mind? Maybe they mean: you cannot falsify an alternative, if you don’ t specify it. But specifying a directional or type of alternative is an outgrowth of specifying a test statistic. Thus we still have the implicit alternatives in Table 3.4 , all of which are open to being falsified with severity. It’ s a key part of test specification to indicate which claims or features of a model are being tested. The charge might stand if a point null is known to be false, for in those cases we can’ t say μ is precisely 0, say. In that case you wouldn’ t want to infer it. One can still set upper bounds for how far off an adequate hypothesis can be. Moreover, there are many cases in science where a point null is inferred severely.
Nordtvedt Effect: Do the Earth and Moon Fall With the Same Acceleration?
We left off Section 3.1 with GTR going through a period of “ stagnation” or “ hibernation” after the early eclipse results. No one knew how to link it up with experiment. Discoveries around 1959– 1960 sparked the “ golden era” or “ renaissance” of GTR, thanks to quantum mechanics, semiconductors, lasers, computers, and pulsars (Will 1986 , p. 14). The stage was set for new confrontations between GTR’ s experiments; from 1960 to 1980, a veritable “ zoo” of rivals to GTR was erected, all of which could be constrained to fit the existing test results.
Not only would there have been too many alternatives to report a pairwise comparison of GTR, the testing had to manage without having full-blown alternative theories of gravity. They could still ask, as they did in 1960: How could it be a mistake to regard the existing evidence as good evidence for GTR (or even for the deflection effect)?
They set out a scheme of parameters, the Parameterized Post Newtonian (PPN) framework, that allowed experimental relativists to describe violations to GTR’ s hypotheses – discrepancies with what it said about specific gravitational phenomena. One parameter is λ – the curvature of spacetime. An explicit goal was to prevent researchers from being biased toward accepting GTR prematurely (Will 1993 , p. 10). These alternatives, by the physicist’ s own admission, were set up largely as straw men to either set firmer constraints on estimates of parameters, or, more interestingly, find violations. They could test 10 or 20 or 50 rivals without having to develop them! The work involved local statistical testing and estimation of parameters describing curved space.
Interestingl
y, these were non-novel hypotheses set up after the data were known. However rival theories had to be viable ; they had to (1) account for experimental results already severely passed and (2) be able to show the relevance of the data for gravitational phenomena. They would have to be able to analyze and explore data about as well as GTR. They needed to permit stringent probing to learn more about gravity. (For an explicit list of requirements for a viable theory, see Will 1993 , pp. 18– 21. 10 )
All the viable members of the zoo of GTR rivals held the equivalence principle (EP) , roughly the claim that bodies of different composition fall with the same accelerations in a gravitational field. This principle was inferred with severity by passing a series of null hypotheses (examples include the Eötvös experiments) that assert a zero difference in the accelerations of two differently composed bodies. Because these null hypotheses passed with high precision, it was warranted to infer that: “ gravity is a phenomenon of curved spacetime,” that is, it must be described by a “ metric theory of gravity” (ibid., p. 10). Those who deny we can falsify non-nulls take note: inferring that an adequate theory must be relativistic (even if not necessarily GTR) was based on inferring a point null with severity! What about the earth and moon, examples of self-gravitating bodies? Do they also fall at the same rate?
While long corroborated for solar system tests, the equivalence principle (later the weak equivalence principle, WEP) was untested for such massive self-gravitating bodies (which requires the strong equivalence principle ). Kenneth Nordtvedt discovered in the 1960s that in one of the most promising GTR rivals, the Brans– Dicke theory, the moon and earth fell at different rates, whereas for GTR there would be no difference. Clifford Will, the experimental physicist I’ ve been quoting, tells how in 1968 Nordtvedt finds himself on the same plane as Robert Dicke. “ Escape for the beleaguered Dicke was unfeasible at this point. Here was a total stranger telling him that his theory violated the principle of equivalence!” (1986 pp. 139– 40). To Dicke’ s credit, he helped Nordtvedt design the experiment. A new parameter to describe the Nordtvedt effect was added to the PPN framework, i.e., η . For GTR, η = 0, so the statistical or substantive null hypothesis tested is that η = 0 as against η ≠ 0 for rivals.
How can they determine the rates at which the earth and moon are falling? Thank the space program. It turns out that measurements of the round trip travel times between the earth and moon (between 1969 and 1975) enable the existence of such an anomaly for GTR to be probed severely (and the measurements continue today). Because the tests were sufficiently sensitive, these measurements provided good evidence that the Nordtvedt effect is absent, set upper bounds to the possible violations, and provided evidence for the correctness of what GTR says with respect to this effect.
So the old saw that we cannot falsify η ≠ 0 is just that, an old saw. Critics take Fisher’ s correct claim, that failure to reject a null isn’ t automatically evidence for its correctness, as claiming we never have such evidence. Even he says it lends some weight to the null (Fisher 1955 ). With the N-P test, the null and alternative needn’ t be treated asymmetrically. In testing H 0 : μ ≥ μ 0 vs. H 1 : μ < μ 0 , a rejection falsifies a claimed increase. 11 Nordtvedt’ s null result added weight to GTR, not in rendering it more probable, but in extending the domain for which GTR gives a satisfactory explanation. It’ s still provisional in the sense that gravitational phenomena in unexplored domains could introduce certain couplings that, strictly speaking, violate the strong equivalence principle. The error statistical standpoint describes the state of information at any one time, with indications of where theoretical uncertainties remain.
You might discover that critics of a significance test’ s falsifying ability are themselves in favor of methods that preclude falsification altogether! Burnham and Anderson raised the scandal, yet their own account provides only a comparative appraisal of fit in model selection. No falsification there.
Souvenir K: Probativism
[A] fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piece-meal results … the payoff is an account that approaches the kind of full-bodied arguments that scientists build up in order to obtain reliable knowledge and understanding of a field.
(Mayo and Cox 2006 , p. 82)
The error statistician begins with a substantive problem or question. She jumps in and out of piecemeal statistical tests both formal and quasi-formal. The pieces are integrated in building up arguments from coincidence, informing background theory, self-correcting via blatant deceptions, in an iterative movement. The inference is qualified by using error probabilities to determine not “ how probable,” but rather, “ how well-probed” claims are, and what has been poorly probed. What’ s wanted are ways to measure how far off what a given theory says about a phenomenon can be from what a “ correct” theory would need to say about it by setting bounds on the possible violations.
An account of testing or confirmation might entitle you to confirm, support, or rationally accept a large-scale theory such as GTR. One is free to reconstruct episodes this way – after the fact – but as a forward-looking account, they fall far short. Even if somehow magically it was known in 1960 that GTR was true, it wouldn’ t snap experimental relativists out of their doldrums because they still couldn’ t be said to have understood gravity, how it behaves, or how to use one severely affirmed piece to opportunistically probe entirely distinct areas. Learning from evidence turns not on appraising or probabilifying large-scale theories but on piecemeal tasks of data analysis: estimating backgrounds, modeling data, and discriminating signals from noise. Statistical inference is not radically different from, but is illuminated by, sexy science, which increasingly depends on it. Fisherian and N-P tests become parts of a cluster of error statistical methods that arise in full-bodied science. In Tour II, I’ ll take you to see the (unwarranted) carnage that results from supposing they belong to radically different philosophies.
1 You will recognize the above as echoing Popperian “ theoretical novelty” – Popper developed it to fit the Einstein test.
2 “ A ray of light nicking the edge of the sun, for example, would bend a minuscule 1.75 arcseconds – the angle made by a right triangle 1 inch high and 1.9 miles long” (Buchen 2009 ).
3 To grasp this, consider that a single black swan proves the hypothesis H : some swans are not white, even though a white swan would not be taken as strong evidence for H ’ s denial. H ’ s denial would be that all swans are white.
4 The general likelihood ratio Λ ( X ) should be contrasted with the simple likelihood ratio associated with the well-known Neyman– Pearson (N-P) lemma, which assumes that the parameter space Θ includes only two values, i.e., Θ ≔ ( θ 0 , θ 1 ) . In such a case no estimation is needed because one can take the simple likelihood ratio. Even though the famous lemma for UMP tests uses the highly artificial case of point against point hypotheses ( θ 0 , θ 1 ), it is erroneous to suppose the recommended tests are intended for this case. A UMP test, after all, alludes to all the possible parameter values, so just picking two and ignoring the others would not be UMP.
5 Initial developments of the severity idea were Mayo (1983 , 1988 , 1991 , 1996 ). In Mayo and Spanos (2006 , 2011 ), it was developed much further.
6 “ By the fall of 1932 there appeared to be several reasons why Neyman might never become a professor in Poland. One was his subject matter, which was not generally recognized as an academic specialty. Another was the fact that he was married to a Russian – and an independent, outspoken Russian who lived on occasion apart from her husband, worked and painted in Paris, traveled on a freighter as a nurse for the adventure of it, and sometimes led tourist excursions into the Soviet Union.” (C. Reid 1998 , p. 105).
7 “ A significance test is defined by a set of [critical] regions [w α ] satisfying the following essential requirements. First
,
this is to avoid such nonsense as saying that data depart significantly from H 0 at the 1% level but not at the 5% level.” Next “ we require that, for all α , .” (Cox and Hinkley 1974 , pp. 90– 1)
8 Barnard was surprised when I showed their paper to him, claiming it was a good example of why scientists tended not to take philosophers seriously. But in this case even the physicists were sufficiently worried to reanalyze the experiment.
9 Data from ESA’ s Gaia mission should enable light deflection to be measured with an accuracy of 2 × 10−6 (Mignard and Klioner 2009 , p. 308).
10 While a viable theory can’ t just postulate the results ad hoc, “ this does not preclude ‘ arbitrary parameters’ being required for gravitational theories to accord with experimental results” (Mayo 2010a , p. 48).
11 Some recommend “ equivalence testing” where H 0 : μ ≥ μ 0 or μ ≤ − μ 0 and rejecting both sets bounds on μ . One might worry about low-powered tests, but it isn’ t essentially different from setting upper bounds for a more usual null. (For discussion see Lakens 2017 , Senn 2001a , 2014, R. Berger and Hsu 1996 , R. Berger 2014 , Wellek 2010).
Tour II
It’ s The Methods, Stupid
There is perhaps in current literature a tendency to speak of the Neyman– Pearson contributions as some static system, rather than as part of the historical process of development of thought on statistical theory which is and will always go on.