Here is what one can say on this subject: One should carefully guard against the tendency to consider as striking an event that one has not specified beforehand , because the number of such events that may appear striking, from different points of view, is very substantial.
(ibid., pp. 964– 5)
The stage fades to black, then a spotlight beams on Neyman and Pearson mid-stage .
N-P : We appear to find disagreement here, but are inclined to think that … the two writers [Bertrand and Borel] are not really considering precisely the same problem. In general terms the problem is this: Is it possible that there are any efficient tests of hypotheses based upon the theory of probability, and if so, what is their nature? … What is the precise meaning of the words ‘ an efficient test of a hypothesis’ ?
[W]e may consider some specified hypothesis, as that concerning the group of stars, and look for a method which we should hope to tell us, with regard to a particular group of stars , whether they form a system, or are grouped ‘ by chance,’ … their relative movements unrelated.
(1933 , p. 140; emphasis added)
If this were what is required of ‘ an efficient test,’ we should agree with Bertrand in his pessimistic view. For however small be the probability that a particular grouping of a number of stars is due to ‘ chance,’ does this in itself provide any evidence of another ‘ cause’ for this grouping but ‘ chance’ ? … Indeed, if x is a continuous variable – as for example is the angular distance between two stars – then any value of x is a singularity of relative probability equal to zero. We are inclined to think that as far as a particular hypothesis is concerned, no test based upon the theory of probability can by itself provide any valuable evidence of the truth or falsehood of that hypothesis. But we may look at the purpose of tests from another view-point.”
(ibid., pp. 141– 2)
Fade to black, spot on narrator mid-stage:
Narrator : We all know our famous lines are about to come. But let’ s linger on the “ as far as a particular hypothesis is concerned.” For any particular case, one may identify data dependent features x that would be highly improbable “ under the particular hypothesis of chance.” Every outcome would too-readily be considered statistically unusual. We must “ carefully guard,” Borel warns, “ against the tendency to consider as striking an event that one has not specified beforehand .” (Lehmann 1993a, p. 964.) If you are required to set the test’ s capabilities ahead of time, then you need to specify the type of falsity of H 0 – the test statistic – beforehand. An efficient test should capture Fisher’ s desire for tests sensitive to departures of interest. You should also wish to avoid tests that more probably find discrepancies when there are none than when present. Listen to Neyman’ s reflection on Borel’ s remarks much later on, in 1977.
Fade to black. Spotlight on an older Neyman, stage right. (He’ s in California, in the background there are palm trees, and Berkeley.)
Neyman : The question (what is an efficient test of a statistical hypothesis) is about an intelligible methodology for deciding whether the observed [difference] … contradicts the stochastic model …
[T]his question was the subject of a lively discussion by Borel and others. Borel was optimistic but insisted that: (a) the criterion to test a hypothesis (a ‘ statistical hypothesis’ ) using some observations must be selected not after the examination of the results of observation , but before, and (b) this criterion should be a function of the observations ‘ en quelque sorte remarquable’ [of a remarkable sort]. It is these remarks of Borel that served as an inspiration to Egon S. Pearson and myself in our effort to build a frequentist theory of testing hypotheses.
(Neyman 1977 , pp. 102– 3)
Fade to black. Spotlight on an older Egon Pearson writing a letter to Neyman about the preprint Neyman sent of his 1977 paper. (The letter is unpublished, but I cite Lehmann 1993a .)
Pearson: I remember that you produced this quotation [from Borel] when we began to get our [1933] paper into shape … The above stages [wherein he had been asking ‘ Why use that particular test statistic?’ ] led up to Borel’ s requirement of finding … a criterion which was “ a function of the observations ‘ en quelque sorte remarquable’” . Now my point is that you and I (perhaps my first leading) had ourselves reached the Borel requirement independently of Borel, because we were serious humane thinkers; Borel’ s expression neatly capped our own.
(pp. 966– 7)
Fade to black. End play .
Egon has the habit of leaving tantalizing claims unpacked, and this is no exception: What exactly is the Borel requirement he thinks they’ d reached due to their being “ serious humane thinkers” ? I can well imagine turning this episode into something like Michael Frayn’ s expressionist play, Copenhagen , wherein a variety of alternative interpretations are entertained based on subsequent work and remarks. I don’ t say that a re-run would enjoy a long life on Broadway, but a small handful of us would relish it.
Inferential Rationales for Test Requirements
It’ s not hard to see that “ as far as a particular” star grouping is concerned, we cannot expect a reliable inference to just any non-chance effect discovered in the data. The more specific the feature is to these particular observations, the more improbable. What’ s the probability of three hurricanes followed by two plane crashes? To cope with the fact that any sample is improbable in some respect, statistical methods do one of two things: appeal to prior probabilities or to error probabilities of a procedure. The former can check our tendency to find a more likely explanation H ′ than chance by an appropriately low prior weight to H ′ . The latter says, we need to consider the problem as of a general type. It’ s a general method, from a test statistic to some assertion about an alternative hypothesis, expressing the non-chance effect. Such assertions may be in error but we can control such erroneous interpretations.
Isn’ t this taken care of by Fisher’ s requirement that Pr(P < p0 ; H 0 ) = P 0 – that the test rarely rejects the null if true? It may be, in practice, Neyman and Pearson thought, but only with certain conditions that were not explicitly codified by Fisher’ s simple significance tests. With just the null hypothesis, it is unwarranted to take low P -values as evidence for a specific “ cause” or non-chance explanation. A statistical effect, even if genuine, underdetermines its explanation; several rivals can be erected post-data, but the ways they could be in error would not have been probed. The fallacy of rejection looms. Fisher (1935a, p. 187) is well aware that “ the same data may contradict the hypothesis in any of a number of different ways,” and that different corresponding tests would be used.
The notion that different tests of significance are appropriate to test different features of the same null hypothesis presents no difficulty to workers engaged in practical experimentation. … [T]he experimenter … is aware of what observational discrepancy it is which interests him, and which he thinks may be statistically significant, before he enquires what test of significance, if any, is available appropriate to his needs.
(ibid., p. 190)
Even if “ an experienced experimenter” knows the appropriate test, this doesn’ t lessen the importance of N-P’ s interest in seeking to identify a statistical rationale for the choices made on informal grounds. There’ s legitimate concern about selecting the alternative that gives the more impressive P -value.
Here’ s Pearson writing with C. Chandra Sekar on testing if a sample has been drawn from a single Normal population:
… it is not possible to devise an efficient test if we only bring into the picture this single normal probability distribution with its two unknown parameters. We must also ask how sensitive the test is in detecting failure of the data to comply with the hypotheses tested, and to deal with this question effectively we must be able to specify the directions in which the hypothesis may fail.
(Pearson and Chandra Sekar 1936, p. 121)
And while:
It is sometimes held that the a
ppropriate test can be chosen after examining the data. [but it will be hard to be unprejudiced at this point].
(ibid., p. 127)
Their position is:
To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptions if the hypothesis is true … By choosing the feature most unfavourable to H 0 out of a very large number of features examined it will usually be possible to find some reason for rejecting the hypothesis. It must be remembered, however, that the point now at issue will not be whether it is exceptional to find a given criterion with so unfavourable a value. We shall need to find an answer to the more difficult question. Is it exceptional that the most unfavourable criterion of the n , say, examined should have as unfavourable a value as this?
(ibid., p. 127)
In short, we’ d have to adjust the attained P -value. In so doing, the goal is not behavioristic but avoiding glaring fallacies in the test at hand, fallacies we know all too well.
The statistician who does not know in advance with which type of alternative to H 0 he may be faced, is in the position of a carpenter who is summoned to a house to undertake a job of an unknown kind and is only able to take one tool with him! Which shall it be? Even if there is an ‘ omnibus’ tool, it is likely to be far less sensitive at any particular job than a specialized one; but the specialized tool will be quite useless under the wrong conditions.
(ibid., p. 126)
Neyman (1952 ) demonstrates that choosing the alternative post-data allows a result that leads to rejection in one test to yield non-rejection in another, despite both adhering to a fixed significance level. (Fisher concedes this as well.) If you are keen to ensure the test is capable of teaching about discrepancies of interest, you should prespecify an alternative hypothesis, where the null and alternative hypothesis exhaust the space, relative to a given question.
The Deconstruction So Far
If we accept the words, “ an efficient test of the hypothesis H ” to mean a statistical (methodological) falsification rule that controls the probabilities of erroneous interpretations of data, and ensures the rejection was because of the underlying cause (as modeled), then we agree with Borel that efficient tests are possible. This requires (i) a prespecified test criterion to avoid verification biases while ensuring power (efficiency), and (ii) consideration of alternative hypotheses to avoid fallacies of acceptance and rejection. We should steer away from isolated or particular curiosities to those that are tracking genuine effects. Fisher is to be credited, Pearson remarks, for his “ emphasis on planning an experiment, which led naturally to the examination of the power function, both in choosing the size of sample so as to enable worthwhile results to be achieved, and in determining the most appropriate tests” (Pearson 1962 , p. 277). If you’ re planning, you’ re prespecifying, perhaps, nowadays, by explicit preregistration.
“ We agree also that not any character, x, whatever is equally suitable to be a basis for an efficient test (Neyman and Pearson 1933 , p. 142).” The test “ criterion should be a function of the observations,” and the alternatives, such that there is a known statistical relationship between the characteristic of the data and the underlying distribution (Neyman 1977 , p. 103). It must enable the error probabilities to be computed under the null and also under discrepancies from the null, despite any unknown parameters.
An exemplary characteristic of this sort is the remarkable properties offered by pivotal test statistics such as Z or T, whose distributions are known:
Z has the standard Normal distribution, and T the Student’ s t distribution, where σ is unknown and thus replaced by the estimator.
Consider the pivot Z. The probability Z > 1.96 is 0.025. But by pivoting, the Z > 1.96 is equivalent to
so it too has probability 0.025. Therefore, the procedure that asserts asserts correctly 95% of the time! 2 We can make valid probabilistic claims about the method that holds post-data, if interpreted correctly . For the severe tester, these also inform about claims that are well and poorly tested (Section 3.7). This leads us on a side trip to Fisher’ s fiducial territory (Section 5.8), and the initial development of the behavioral performance idea. First, let’ s trace some of the ways our miserable citation has been interpreted by contemporaries.
The Miserable Passage in the Hands of Contemporaries
Without hoping to know whether each separate hypothesis is true or false, we may search for rules to govern our behavior with regard to them, in following which we insure that, in the long run of experience, we shall not be too often wrong (Neyman and Pearson 1933 , p. 142)
Ian Hacking.
According to Ian Hacking (1965 ) this passage shows that Neyman and Pearson endorse something “ more radical than anything I have mentioned so far” (p. 103). What they are saying is that “ there is no alternative to certainty and ignorance” (p. 104). If probability only applies to the rule’ s long-run error control, Hacking is saying in 1965, it’ s not an account of inductive inference. This is precisely what he comes to deny in his 1980 “ retraction” (Section 2.1 Exhibit (iii) ), but here he’ s leading the posse in philosophy toward the most prevalent view, even though it comes in different forms.
Isaac Levi.
Isaac Levi (1980 , p. 404), reacting to Hacking (1965) claims “ Hacking misinterprets these remarks [in our passage] when he attributes to Neyman and Pearson the view that ‘ there is no alternative to certainty or ignorance.’” Even N-P allow intermediate standpoints when legitimate prior probabilities are available. Finding them to be so rarely available, N-P were led to methods whose validity would not depend on priors. Except for such cases, Levi concurs with the early Hacking, that N-P deny evidence altogether. According to Levi, N-P are “ objectivist necessitarians” who stake out a rather robotic position: tests only serve as routine programs for “ selecting policies rather than using such reports as evidence” (ibid., p. 408). While this might be desirable in certain contexts, Levi objects, “ this does not entitle objectivist necessitarians to insist that rational agents should always assess benefits in terms of the long run nor to favor routinization over deliberation” (ibid.).
These construals by philosophers made sense in the context of seeking an inductive logic that would assign a degree of rational belief, support or confirmation to statistical hypotheses. N-P opposed such a view. Their attitude is, we’ re just giving examples to illustrate and capsulize a rationale to underwrite the tests chosen on intuitive grounds. Even in Neyman– Pearson (1928 ):
[T]he tests should only be regarded as tools which must be used with discretion and understanding, and not as instruments which in themselves give the final verdict. … we must not discard the original hypothesis until we have examined the alternative suggested, and have satisfied ourselves that it does involve a change in the real underlying factors in which we are interested … that the alternative hypothesis is not error in observation, error in record, variation due to some outside factor that it was believed had been controlled, or to any one of many causes …
(p. 58)
In the 1933 paper, they explicitly distinguish their account from contexts where values enter other than controlling erroneous interpretations of data: “ [I]t is possible that other conceptions of relative value may be introduced. But the problem is then no longer the simple one of discriminating between hypotheses” (1933 , p. 148).
Howson and Urbach.
Howson and Urbach interpret this same passage in yet another, radical manner. They regard it as “ evident” that for Neyman and Pearson, acceptance and rejection of hypotheses is “ the adoption of the same attitude towards them as one would take if one had an unqualified belief in their truth or falsehood” (1993 , p. 204), putting up “ his entire stock of worldly goods” upon a single statistically significant result (p. 203). Even on the strictest behavioristic formulation, “ to accept a hypothes
is H means only to decide to take action A rather than action B” (Neyman 1950 , p. 259). It could be “ decide” to declare the study unreplicable, publish a report, tell a patient to get another test, announce a genuine experimental effect, or whatever. A particular action always had to be spelled out, it was never to take any and all actions as if you had “ unqualified belief.”
Neyman, not Pearson, is deemed the performance-oriented one, but even he used conclude and decide interchangeably:
The analyses we performed led us to ‘ conclude’ or ‘ decide’ that the hypotheses tested could be rejected without excessive risk of error. In other words, after considering the probability of error (that is, after considering how frequently we would be in error if in conditions of our data we rejected the hypotheses tested), … we decided to act on the assumption (or concluded ) that the two groups are not random samples from the same population.
(1976 , 750– 1; the emphasis is Neyman’ s)
What would make the reading closer to severity than performance is for the error probability to indicate what would/would not be a warranted cause of the observations. It’ s important, too, to recognize Neyman’ s view of inquiry: “ A study of any serious substantive problem involves a sequence of incidents at which one is forced to pause and consider what to do next. In an effort to reduce the frequency of misdirected activities one uses statistical tests” (1976 , p. 737). Rather than a series of unrelated tests, a single inquiry involves numerous piecemeal checks, and the error control promotes the “ lift-off.” Mistakes in one part ramify in others so as to check the overall inference. Even if Neyman wasn’ t consciously aware of the rationale behind tests picked out by these concerns, they still may be operative.
Statistical Inference as Severe Testing Page 52