Statistical Inference as Severe Testing
Page 27
This was supposed to be something Fisher would like, so what happened to P -values? They have a slight walk-on part: the rejected hypothesis is the one that has the lower P -value. Its value is irrelevant, but it directs you to which posterior to compute. We might understand his Bayesian error probabilities this way: If I’ ve rejected H 0 , I’ d be wrong if H 0 were true, so Pr(H 0 | x ) is a probability of being wrong about H 0 . It’ s the Bayesian Type I error probability2 . If instead you reject H 1 , then you’ d be wrong if H 1 were true. So in that case you report the Bayesian Type II error probability2 , which would be Pr(H 1 | x ) = 1/[1 + B( x )]. Whatever you think of these, they’ re quite different from error probability1 , which does not use priors in Hi .
Sleight of Hand? Surprisingly,
Berger claims to give a “ dramatic illustration of the nonfrequentist nature of P -values” (ibid., p. 3). Wait a second, how did they become non-frequentist ? What he means is that the P -value can be shown to disagree with the special posterior probability for H 0 , defined as error probability2 . They’ re not called Bayesian error probabilities any more but frequentist conditional error probabilities (CEPs). Presto! A brilliant sleight of hand.
This 0.5 prior is not supposed to represent degree of belief, but it is Berger’ s “ objective” default Bayesian prior. Why does he call it frequentist? He directs us to an applet showing if we imagine randomly selecting our test hypothesis from a population of null hypotheses, 50% of which are true, the rest false, and then compute the relative frequency of true nulls conditional on its having been rejected at significance level p , we get a number that is larger than p . This violates what he calls the frequentist principle (not to be confused with FEV):
Berger’ s frequentist principle : Pr(H 0 true | H 0 rejected at level p ) should equal p .
This is very different from what a P -value gives us, namely, Pr(P ≤ p ; H 0 ) = p (or Pr(d( X ) ≥ d( x 0 ); H 0 ) = p ).
He actually states the frequentist principle more vaguely; namely, that the reported error probability should equal the actual one, but the computation is to error probability2 . If I’ m not being as clear as possible, it’ s because Berger isn’ t, and I don’ t want to prematurely saddle him with one of at least two interpretations he moves between. For instance, Berger says the urn of nulls applet is just a heuristic, showing how it could happen. So suppose the null was randomly selected from an urn of nulls 50% of which are true. Wouldn’ t 0.5 be its frequentist prior? One has to be careful. First consider a legitimate frequentist prior. Suppose I selected the hypothesis H 0 : that the mean temperature in the water, θ , is 150 degrees (Section 3.2 ). I can see this value resulting from various features of the lake and cooling apparatus, and identify the relative frequency that θ takes different values. {Θ = θ } is an event associated with random variable Θ . Call this an empirical or frequentist prior just to fix the notion. What’ s imagined in Berger’ s applet is very different. Here the analogy is with diagnostic screening for disease, so I will call it that (Section 5.6 ). We select one null from an urn of nulls, which might include all hypotheses from a given journal, a given year, or lots of other things. 7 If 50% of the nulls in this urn are true, the experiment of randomly selecting a null from the urn could be seen as a Bernoulli trial with two outcomes: a null that is true or false. The probability of selecting a null that has the property “ true” is 0.5. Suppose I happen to select H 0 : θ = 150, the hypothesis from the accident at the water plant. It would be incorrect to say 0.5 was the relative frequency that θ = 150 would emerge with the empirical prior. So there’ s a frequentist computation, but it differs from what Neyman’ s empirical Bayesian would assign it. I’ ll come back to this later (Excursion 6).
Suppose instead we keep to the default Bayesian construal that Berger favors. The priors come from one or another conventional assignment. On this reading, his frequentist principle is: the P -value should equal the default posterior on H 0 . That is, a reported P -value should equal error probability2 . By dropping the designation “ Bayesian” that he himself recommended “ to differentiate between the schools” (p. 30), it’ s easy to see how confusion ensues.
Berger emphasizes that the confusion he is on about “ is different from the confusion between a P -value and the posterior probability of the null hypothesis” (p. 4). What confusion? That of thinking P -values are frequentist error probabilities2 – but he has just introduced the shift of meaning! But the only way error probability2 inherits a frequentist meaning is by reference to the heuristic (where the prior is the proportion of true nulls in a hypothetical urn of nulls), giving a diagnostic screening posterior probability. The subscripts are a lifesaver for telling what’ s true when definitions shift about throughout an argument. The frequentist had only ever wanted error probabilities1 – the ones based solely on the sampling distribution of d( X ). Yet now he declares that error probability2 – Bayesian error probability – is the only real or relevant frequentist error probability! If this is the requirement, preset α , β aren’ t error probabilities either.
It might be retorted, however, that this was to be a compromise position. We can’ t dismiss it out of hand because it requires Neyman and Fisher to become default Bayesians. To smoke the peace pipe, everyone has to give a little. According to Berger, “ Neyman criticized p-values for violating the frequentist principle.” (p. 3) With Berger’ s construal, it is not violated. So it appears Neyman gets something. Does he? We know N-P used P -values, and never saw them as non-frequentist; and surely Neyman wouldn’ t be criticizing a P -value for not being equal to a default (or other) posterior probability. Hence Nancy Reid’ s quip: “ the Fisher/Jeffreys agreement is essentially to have Fisher” kowtow to Jeffreys (N. Reid 2003 ). The surest sign that we’ ve swapped out meanings are the selling points.
Consider the Selling Points
“ Teaching statistics suddenly becomes easier … it is considerably less important to disabuse students of the notion that a frequentist error probability is the probability that the hypothesis is true, given the data” (Berger 2003, p. 8), since his error probability2 actually has that interpretation. We are also free of having to take into account the stopping rule used in sequential tests (ibid.). As Berger dangles his tests in front of you with the labels “ frequentist,” “ error probabilities,” and “ objectivity,” there’ s one thing you know: if the methods enjoy the simplicity and freedom of paying no price for optional stopping, you’ ll want to ask if they’ re also controlling error probabilities1 . When that handwringing disappears, unfortunately, so does our assurance that we block inferences that have passed with poor severity.
Whatever you think of default Bayesian tests, Berger’ s error probability2 differs from N-P’ s error probability1 . N-P requires controlling the Type I and II error probabilities at low values regardless of prior probability assignments. The scrutiny here is not of Berger’ s recommended tests – that comes later. The scrutiny here is merely to shine a light on the type of shifting meanings that our journey calls for. Always carry your walking stick – it serves as a metaphorical subscript to keep you afloat.
Souvenir M: Quicksand Takeaway
The howlers and chestnuts of Section 3.4 call attention to: the need for an adequate test statistic, the difference between an i-assumption and an actual assumption, and that tail areas serve to raise, and not lower, the bar for rejecting a null hypothesis. The stop in Section 3.5 pulls back the curtain on one front of typical depictions of the N-P vs. Fisher battle, and Section 3.6 disinters equivocal terms in a popular peace treaty between the N-P, Fisher, and Jeffreys tribes. Of these three stops, I admit that the last may still be murky. One strategy we used to clarify are subscripts to distinguish slippery terms. Probabilities of Type I and Type II errors, as well as P -values, are defined exclusively in terms of the sampling distribution of d( X ), under a statistical hypothesis of interest. That’ s error probability1 . Error probability2 , in addition to requiring priors, involves conditioning on the
particular outcome, with the hypothesis varying. There’ s no consideration of the sampling distribution of d( X ), if you’ ve conditioned on the actual outcome. A second strategy is to consider the selling points of the new “ compromise” construal, to gauge what it’ s asking you to buy.
Here’ s from our guidebook:
You’ re going to need to be patient. Depending on how much quicksand is around you, it could take several minutes or even hours to slowly, methodically get yourself out …
Relax . Quicksand usually isn’ t more than a couple feet deep … If you panic you can sink further, but if you relax, your body’ s buoyancy will cause you to float.
Breathe deeply … It is impossible to “ go under” if your lungs are full of air (WikiHow 2017 ).
In later excursions, I promise, you’ ll get close enough to the edge of the quicksand to roll easily to hard ground. More specifically, all of the terms and arguments of Section 3.6 will be excavated.
1 Neyman (1976 ) said he was “ not aware of a conceptual difference between a ‘ test of a statistical hypothesis’ and a ‘ test of significance’ and uses these terms interchangeably” (p. 737). We will too, with qualifications as needed.
2 Thanks to the interpretation being fairly intimately related to the test, we get the error probabilities (formal or informal) attached to the interpretation.
3 Note that p obs and t obs are the same as our p 0 and d 0 . (or d( x 0 ))
4 Fisher, in a 1926 paper, gives another nice rendering: “ A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance. The very high odds sometimes claimed for experimental results should usually be discounted, for inaccurate methods of estimating error have far more influence than has the particular standard of significance chosen” (pp. 504– 5).
5 Pearson said that a statistician has an α and a β side, the former alludes to what they say in theory, the latter to what they do in practice. In practice, even Neyman, so often portrayed as performance-oriented, was as inferential as Pearson.
6 We are forced to spend more time on P -values than one would wish simply because so many of the criticisms and proposed reforms are in terms of them.
7 It is ironic that it’ s in the midst of countering a common charge that he requires repeated sampling from the same population that Neyman (1977 ) talks about a series of distinct scientific inquiries (presumably independent) with Type I and Type II error probabilities (for specified alternatives) α 1 , α 2 , … , α n , … and β 1 , β 2 , … , β n , …
I frequently hear a particular regrettable remark … that the frequency interpretation of either the level of significance α or of power (1 – β ) is only possible when one deals many times WITH THE SAME HYPOTHESIS H , TESTED AGAINST THE SAME ALTERNATIVE.
(Neyman 1977 , 109, his use of capitals)
From the Central Limit Theorem, Neyman remarks:
The relative frequency of the first kind of errors will be close to the arithmetic mean of numbers α 1 , α 2 , … , α n , … Also the relative frequency of detecting the falsehood of the hypotheses tested, when false … will differ but little from the average of [the corresponding powers, for specified alternatives].
Tour III
Capability and Severity: Deeper Concepts
From the itinerary : A long-standing family feud among frequentists is between hypotheses tests and confidence intervals (CIs), but in fact there’ s a clear duality between the two. The dual mission of the first stop (Section 3.7 ) of this tour is to illuminate both CIs and severity by means of this duality. A key idea is arguing from the capabilities of methods to what may be inferred. The severity analysis seamlessly blends testing and estimation. A typical inquiry first tests for the existence of a genuine effect and then estimates magnitudes of discrepancies, or inquires if theoretical parameter values are contained within a confidence interval. At the second stop (Section 3.8 ) we reopen a highly controversial matter of interpretation that is often taken as settled. It relates to statistics and the discovery of the Higgs particle – displayed in a recently opened gallery on the “ Statistical Inference in Theory Testing” level of today’ s museum.
3.7 Severity, Capability, and Confidence Intervals (CIs)
It was shortly before Egon offered him a faculty position at University College starting 1934 that Neyman gave a paper at the Royal Statistical Society (RSS) which included a portion on confidence intervals, intending to generalize Fisher’ s fiducial intervals. With K. Pearson retired (he’ s still editing Biometrika but across campus with his assistant Florence David), the tension is between E. Pearson, along with remnants of K.P.’ s assistants, and Fisher on the second and third floors, respectively. Egon hoped Neyman’ s coming on board would melt some of the ice.
Neyman’ s opinion was that “ Fisher’ s work was not really understood by many statisticians … mainly due to Fisher’ s very condensed form of explaining his ideas” (C. Reid 1998 , p. 115). Neyman sees himself as championing Fisher’ s goals by means of an approach that gets around these expository obstacles. So Neyman presents his first paper to the Royal Statistical Society (June, 1934), which includes a discussion of confidence intervals, and, as usual, comments (later published) follow. Arthur Bowley (1934), a curmudgeon on the K.P. side of the aisle, rose to thank the speaker. Rubbing his hands together in gleeful anticipation of a blow against Neyman by Fisher, he declares: “ I am very glad Professor Fisher is present, as it is his work that Dr Neyman has accepted and incorporated. … I am not at all sure that the ‘ confidence’ is not a confidence trick” (p.132). Bowley was to be disappointed. When it was Fisher’ s turn, he was full of praise. “ Dr Neyman … claimed to have generalized the argument of fiducial probability, and he had every reason to be proud of the line of argument he had developed for its perfect clarity” (Fisher 1934c , p.138). Caveats were to come later (Section 5.7 ). For now, Egon was relieved:
Fisher had on the whole approved of what Neyman had said. If the impetuous Pole had not been able to make peace between the second and third floors of University College, he had managed at least to maintain a friendly foot on each!
(C. Reid 1998 , p. 119)
CIs, Tests, and Severity.
I’ m always mystified when people say they find P -values utterly perplexing while they regularly consume polling results in terms of confidence limits. You could substitute one for the other.
Suppose that 60% of 100 voters randomly selected from a population U claim to favor candidate Fisher. An estimate of the proportion of the population who favor Fisher, θ , at least at this point in time, is typically given by means of confidence limits. A 95% confidence interval for θ is where is the observed proportion and we estimate by plugging in for θ to get . The 95% CI limits for θ = 0.6 ± 0.09 using the Normal approximation. The lower limit is 0.51 and the upper limit is 0.69. Often, 0.09 is reported as the margin of error . We could just as well have asked, having observed ,
what value of θ would 0.6 be statistically significantly greater than at the 0.025 level, and what value of θ would 0.6 be statistically significantly less than at the 0.025 level?
The two answers would yield 0.51 and 0.69, respectively. So infer θ > 0.51 and infer θ < 0.69 (against their denials), each at level 0.025, for a combined error probability of 0.05.
Not only is there a duality between confidence interval estimation and tests, they were developed by Jerzy Neyman at the same time he was developing tests! The 1934 paper in the opening to this tour builds on Fisher’ s fiducial intervals dated in 1930, but he’ d been lecturing on it in Warsaw for a few years already. Providing upper and lower confidence limits shows the range of plausible values for the parameter and avoids an “ up/down” dichotomous tendency of some users of tests. Yet, for some reason, CIs are still often used in a dichotomous manner: rejecting μ values excluded from the interval, accepting (as plausible or the like) those included. There’ s the tendency, as well, to
fix the confidence level at a single 1 − α , usually 0.9, 0.95, or 0.99. Finally, there’ s the adherence to a performance rationale: the estimation method will cover the true θ 95% of the time in a series of uses. We will want a much more nuanced, inferential construal of CIs. We take some first steps toward remedying these shortcomings by relating confidence limits to tests and to severity.
To simply make these connections, return to our test T+, an IID sample from a Normal distribution, H 0 : μ ≤ μ 0 against H 1 : μ > μ 0 . In a CI estimation procedure, an observed statistic is used to set an upper or lower (one-sided) bound, or both upper and lower (two-sided) bounds for parameter μ . Good and best properties of tests go over into good or best properties of corresponding confidence intervals. In particular, the uniformly most powerful (UMP) test T+ corresponds to a uniformly most accurate lower confidence bound (see Lehmann and Romano 2005 , p. 72). The (1 − α ) uniformly most accurate (UMA) lower confidence bound for μ , which I write as , corresponding to test T+ is
where is the sample mean, and the area to the right of c α under the standard Normal distribution is α . That is Pr(Z ≥ c α ) = α where Z is the standard Normal statistic. Here are some useful approximate values for c α :
α 0.5 0.16 0.05 0.025 0.02 0.005 0.001
c α 0 1 1.65 1.96 2 2.5 3
The Duality
“ Infer: ” alludes to the rule for inferring; it is the CI estimator . Substituting for yields an estimate . Here are some abbreviations, alluding throughout to our example of a UMA estimator: