Book Read Free

Statistical Inference as Severe Testing

Page 35

by Deborah G Mayo

0.05 has a higher confidence level (0.93) than does µ > 0.15 (0.7) 3 even though the point hypothesis µ = 0.05 is less likely than µ = 0.15 (the latter is closer to than is the former). Each point in the lower CI corresponds to a different lower bound, each associated with a different confidence level, and corresponding severity assessment. That’ s how to distinguish them.

  Second there’ s an equivocation, or at least a potential equivocation, in Cumming’ s assertion “ that for [2.5%] of replications the [lower limit] will exceed the true value” (Cumming 2012 , p. 112 replacing 5% with 2.5%). This is not a true claim if “ lower limit” is replaced by a particular lower limit: , it holds only for the generic lower limit . That is, we can’ t say µ exceeds zero 2.5% of the time, which would be to assign a probability of 0.975 to µ > 0 . Yet this misinterpretation of CIs is legion, as we’ ll see in a historical battle about fiducial intervals (Section 5.8).

  4.4 Do P -Values Exaggerate the Evidence?

  “ Significance levels overstate the evidence against the null hypothesis,” is a line you may often hear. Your first question is:

  What do you mean by overstating the evidence against a hypothesis?

  Several (honest) answers are possible. Here is one possibility:

  What I mean is that when I put a lump of prior weight π 0 of 1/2 on a point null H 0 (or a very small interval around it), the P -value is smaller than my Bayesian posterior probability on H 0 .

  More generally, the “ P -values exaggerate” criticism typically boils down to showing that if inference is appraised via one of the probabilisms – Bayesian posteriors, Bayes factors, or likelihood ratios – the evidence against the null (or against the null and in favor of some alternative) isn’ t as big as 1 − P .

  You might react by observing that: (a) P -values are not intended as posteriors in H 0 (or Bayes ratios, likelihood ratios) but rather are used to determine if there’ s an indication of discrepancy from, or inconsistency with, H 0 . This might only mean it’ s worth getting more data to probe for a real effect. It’ s not a degree of belief or comparative strength of support to walk away with. (b) Thus there’ s no reason to suppose a P -value should match numbers computed in very different accounts, that differ among themselves, and are measuring entirely different things. Stephen Senn gives an analogy with “ height and stones” :

  … [S]ome Bayesians in criticizing P-values seem to think that it is appropriate to use a threshold for significance of 0.95 of the probability of the alternative hypothesis being true. This makes no more sense than, in moving from a minimum height standard (say) for recruiting police officers to a minimum weight standard, declaring that since it was previously 6 foot it must now be 6 stone.

  (Senn 2001b , p. 202)

  To top off your rejoinder, you might ask: (c) Why assume that “ the” or even “ a” correct measure of evidence (relevant for scrutinizing the P -value) is one of the probabilist ones?

  All such retorts are valid, and we’ ll want to explore how they play out here. Yet, I want to push beyond them. Let’ s be open to the possibility that evidential measures from very different accounts can be used to scrutinize each other.

  Getting Beyond “ I’ m Rubber and You’ re Glue” .

  The danger in critiquing statistical method X from the standpoint of the goals and measures of a distinct school Y, is that of falling into begging the question. If the P -value is exaggerating evidence against a null, meaning it seems too small from the perspective of school Y, then Y’ s numbers are too big, or just irrelevant, from the perspective of school X. Whatever you say about me bounces off and sticks to you. This is a genuine worry, but it’ s not fatal. The goal of this journey is to identify minimal theses about “ bad evidence, no test (BENT)” that enable some degree of scrutiny of any statistical inference account – at least on the meta-level. Why assume all schools of statistical inference embrace the minimum severity principle? I don’ t, and they don’ t. But by identifying when methods violate severity, we can pull back the veil on at least one source of disagreement behind the battles.

  Thus, in tackling this latest canard, let’ s resist depicting the critics as committing a gross blunder of confusing a P -value with a posterior probability in a null. We resist, as well, merely denying we care about their measure of support. I say we should look at exactly what the critics are on about. When we do, we will have gleaned some short-cuts for grasping a plethora of critical debates. We may even wind up with new respect for what a P -value, the least popular girl in the class, really does .

  To visit the core arguments, we travel to 1987 to papers by J. Berger and Sellke, and Casella and R. Berger. These, in turn, are based on a handful of older ones (Cox 1977 , E, L, & S 1963 , Pratt 1965 ), and current discussions invariably revert back to them. Our struggles through quicksand of Excursion 3, Tour II , are about to pay large dividends.

  J. Berger and Sellke, and Casella and R. Berger.

  Berger and Sellke (1987a) make out the conflict between P -values and Bayesian posteriors by considering the two-sided test of the Normal mean, H 0 : μ = 0 vs. H 1 : μ ≠ 0 . “ Suppose that X = (X 1 , … , Xn ) , where the Xi are IID N( μ , σ 2 ), σ 2 known” (p. 112). Then the test statistic , and the P -value will be twice the P-value of the corresponding one-sided test.

  Starting with a lump of prior, generally 0.5, on the point hypothesis H 0 , they find the posterior probability in H 0 is larger than the P -value for a variety of different priors on the alternative. However, the result depends entirely on how the remaining 0.5 is allocated or smeared over the alternative (a move dubbed spike and smear). Using what they call a Jeffreys-type prior, the 0.5 is spread out over the alternative parameter values as if the parameter is itself distributed N(µ 0 , σ ). Now Harold Jeffreys recommends the lump prior only to capture cases where a special value of a parameter is deemed plausible, for instance, the GTR deflection effect λ = 1.75″ , after about 1960. The rationale is to avoid a 0 prior on H 0 and enable it to receive a reasonable posterior probability .

  By subtitling their paper “ The irreconcilability of P -values and evidence,” Berger and Sellke imply that if P -values disagree with posterior assessments, they can’ t be measures of evidence at all. Casella and R. Berger (1987) retort that “ reconciling” is at hand, if you move away from the lump prior. So let’ s see how this unfolds. I assume throughout, as do the critics, that the P -values are “ audited,” so that neither selection effects nor violated model assumptions are in question at this stage. I see no other way to engage their arguments.

  Table 4.1 gives the values of Pr(H 0 | x ). We see that we would declare no evidence against the null, and even evidence for it (to the degree indicated by the posterior) whenever d( x ) fails to reach a 2.5 or 3 standard error difference. With n = 50, “ one can classically ‘ reject H 0 at significance level p = 0.05,’ although Pr(H 0 | x ) = 0.52 (which would actually indicate that the evidence favors H 0 )” (J. Berger and Sellke 1987 , p. 113).

  Table 4.1 Pr(H 0 | x ) for Jeffreys-type prior

  P one-sided z α n (sample size)

  10 20 50 100 1000

  0.05 1.645 0.47 0.56 0.65 0.72 0.89

  0.025 1.960 0.37 0.42 0.52 0.60 0.82

  0.005 2.576 0.14 0.16 0.22 0.27 0.53

  0.0005 3.291 0.024 0.026 0.034 0.045 0.124

  (From Table 1, J. Berger and T. Sellke (1987 ) p. 113 using the one-sided P -value)

  If n = 1000, a result statistically significant at the 0.05 level results in the posterior probability to μ = 0 going up from 0.5 (the lump prior) to 0.82! From their Bayesian perspective, this suggests P -values are exaggerating evidence against H 0 . Error statistical testers, on the other hand, balk at the fact that using the recommended priors allows statistically significant results to be interpreted as no evidence against H 0 – or even evidence for it. They point out that 0 is excluded from the two-sided confidence interval at level 0.95. Although a posterior probability doesn’ t have an error probability attached, a tester can evaluate
the error probability credentials of these inferences. Here we’ d be concerned with a Type II error: failing to find evidence against the null, and providing a fairly high posterior for it, when it’ s false (Souvenir I).

  Let’ s use a less extreme example where we have some numbers handy: our water-plant accident. We had σ = 10, n = 100 leading to the nice ( σ /√ n ) value of 1. Here it would be two-sided, to match their example: H 0 : μ = 150 vs. H 1 : μ ≠ 150 . Look at the second entry of the 100 column, the posterior when z α = 1.96 . With the Jeffreys prior, perhaps championed by the water coolant company, J. Berger and Sellke assign a posterior of 0.6 to H 0 : μ = 150 degrees when a mean temperature of 152 (151.96) degrees is observed – reporting decent evidence the cooling mechanism is working just fine. How often would this occur even if the actual underlying mean temperature is, say, 151 degrees? With a two-sided test, cutting off 2 standard errors on either side, we’ d reject whenever either or . The probability of the second is negligible under µ = 151, so the probability we want is The probability of declaring evidence for 150 degrees (with posterior of 0.6 to H 0 ) even if the true increase is actually 151 degrees is around 0.84; 84% of the time they erroneously fail to ring the alarm, and would boost their probability of μ = 150 from 0.5 to 0.6. Thus, from our minimal severity principle, the statistically significant result can’ t even be taken as evidence for compliance with 151 degrees, let alone as evidence for the null of 150 (Table 3.1 ).

  Is this a problem for them? It depends what you think of that prior. The N-P test, of course, does not use a prior, although, as noted earlier, one needn’ t rule out a frequentist prior on mean water temperature after an accident (Section 3.2 ). For now our goal is making out the criticism.

  Jeffreys– Lindley “ Paradox” or Bayes/Fisher Disagreement

  But how, intuitively, does it happen that a statistically significant result corresponds to a Bayes boost for H 0 ? Go back to J. Berger and Sellke’ s example of Normal testing of H 0 : μ = 0 vs. H 1 : μ ≠ 0 . Some sample mean will be close enough to 0 to increase the posterior for H 0 . By choosing a sufficiently large n , even a statistically significant result can correspond to large posteriors on H 0 . This is the Jeffreys– Lindley “ paradox,” which some more aptly call the Bayes/Fisher disagreement. Lindley’ s famous result dealt with just this example, two-sided Normal testing with known variance. With a lump given to the point null, and the rest appropriately spread over the alternative, an n can be found such that an α significant result corresponds to Pr(H 0 | x ) = (1 − α )! We can see by extending Table 4.1 to arbitrarily large n , we can get a posterior for the null of 0.95, when the (two-sided) P -value is 0.05. Many say you should decrease the required P -value for significance as n increases; and Cox and Hinkley (1974 , p. 397) provide formulas to achieve this and avoid the mismatch. There’ s nothing in N-P or Fisherian theory to oppose this. I won’ t do that here, as I want to make out the criticism. We need only ensure that the interpretation takes account of the (obvious) fact that, with a fixed P -value and increasing n , the test is more and more sensitive to smaller and smaller discrepancies. Using a smaller plate at the French restaurant may make the portion appear bigger, but, Jackie Mason notwithstanding, knowing the size of the plate, I can see there’ s not much there.

  Why assign the lump of ½ as prior to the point null? “ The choice of π 0 = 1/2 has obvious intuitive appeal in scientific investigations as being ‘ objective’” say J. Berger and Sellke (1987 , p. 115). But is it? One starts by making H 0 and H 1 equally probable, then the 0.5 accorded to H 1 is spread out over all the values in H 1 : “ The net result is that all values of [ μ ] are far from being equally likely” (Senn 2015a ). Any small group of μ values in H 1 gets a tiny prior. David Cox describes how it happens:

  … if a sample say at about the 5% level of significance is achieved, then either H 0 is true or some alternative in a band of order 1/√ n ; the latter possibility has, as n → ∞ , prior probability of order 1/√ n and hence at a fixed level of significance the posterior probabilities shift in favour of H 0 as n increases.

  (Cox 1977 , p. 59)

  What justifies the lump prior of 0.5?

  A Dialogue at the Water Plant Accident

  EPA Rep: The mean temperature of the water was found statistically significantly higher than 150 degrees at the 0.025 level.

  Spiked Prior Rep: This even strengthens my belief the water temperature’ s no different from 150. If I update the prior of 0.5 that I give to the null hypothesis, my posterior for H 0 is still 0.6; it’ s not 0.025 or 0.05, that’ s for sure.

  EPA Rep: Why do you assign such a high prior probability to H 0 ?

  Spiked Prior Rep: If I gave H 0 a value lower than 0.5, then, if there’ s evidence to reject H 0 , at most I would be claiming an improbable hypothesis has become more improbable.

  [W]ho, after all, would be convinced by the statement ‘ I conducted a Bayesian test of H 0 , assigning prior probability 0.1 to H 0 , and my conclusion is that H 0 has posterior probability 0.05 and should be rejected?’ .

  (J. Berger and Sellke 1987 , p. 115)

  This quote from J. Berger and Sellke is peculiar. They go on to add: “ We emphasize this obvious point because some react to the Bayesian– classical conflict by attempting to argue that [prior] π 0 should be made small in the Bayesian analysis so as to force agreement” (ibid.). We should not force agreement. But it’ s scarcely an obvious justification for a lump of prior on the null H 0 – one which results in a low capability to detect discrepancies – that it ensures, if they do reject H 0 , there will be a meaningful drop in its probability. Let’ s listen to the pushback from Casella and R. Berger (1987a ), the Berger being Roger now (I use initials to distinguish them).

  The Cult of the Holy Spike and Slab.

  Casella and R. Berger (1987a ) charge that the problem is not P -values but the high prior, and that “ concentrating mass on the point null hypothesis is biasing the prior in favor of H 0 as much as possible” (p. 111) whether in one- or two-sided tests. According to them:

  The testing of a point null hypothesis is one of the most misused statistical procedures. In particular, in the location parameter problem, the point null hypothesis is more the mathematical convenience than the statistical method of choice.

  (ibid., p. 106)

  Most of the time “ there is a direction of interest in many experiments, and saddling an experimenter with a two-sided test would not be appropriate” (ibid.). The “ cult of the holy spike” is an expression I owe to Sander Greenland (personal communication).

  By contrast, we can reconcile P -values and posteriors in one-sided tests if we use more diffuse priors. (e.g., Cox and Hinkley 1974 , Jeffreys 1939/1961 , Pratt 1965 ). In fact, Casella and Berger show that for sensible priors in that case, the P -value is at least as big as the minimum value of the posterior probability on the null, again contradicting claims that P -values exaggerate the evidence. 4

  J. Berger and Sellke (1987 ) adhere to the spikey priors, but following E, L, & S (1963 ), they’ re keen to show that P-values exaggerate evidence even in cases less extreme than the Jeffreys posteriors in Table 4.1 . Consider the likelihood ratio of the null hypothesis over the hypothesis most generous to the alternative, they say. This is the point alternative with maximum likelihood, H max – arrived at by setting . Through their tunnel, it’ s disturbing that even using this likelihood ratio, the posterior for H 0 is still larger than 0.05 – when they give a 0.5 spike to both H 0 and H max . Some recent authors see this as the key to explain today’ s lack of replication of significant results. Through the testing tunnel, things look different (Section 4.5 ).

  Why Blame Us Because You Can’ t Agree on Your Posterior?

  Stephen Senn argues that the reason for the wide range of variation of the posterior is the fact that it depends radically on the choice of alternative to the null and its prior. 5 According to Senn, “… the reason that Bayesians can regard P -values as overstating the evidence again
st the null is simply a reflection of the fact that Bayesians can disagree sharply with each other” (Senn 2002 , p. 2442). Senn illustrates how “ two Bayesians having the same prior probability that a hypothesis is true and having seen the same data can come to radically different conclusions because they differ regarding the alternative hypothesis” (Senn 2001b , p. 195). One of them views the problem as a one-sided test and gets a posterior on the null that matches the P -value; a second chooses a Jeffreys-type prior in a two-sided test, and winds up with a posterior to the null of 1 − p !

  Here’ s a recap of Senn’ s example (ibid., p. 200): Two scientists A and B are testing a new drug to establish its treatment effect, δ , where positive values of δ are good. Scientist A has a vague prior whereas B, while sharing the same distribution about the probability of positive values of δ , is less pessimistic than A regarding the effect of the drug. If it’ s not useful, B believes it will have no effect. They “ share the same belief that the drug has a positive effect. Given that it has a positive effect, they share the same belief regarding its effect. … They differ only in belief as to how harmful it might be.” A clinical trial yields a difference of 1.65 standard units, a one-sided P -value of 0.05. The result is that A gives 1/20 posterior probability to H 0 : the drug does not have a positive effect, while B gives a probability of 19/20 to H 0 . B is using the two-sided test with a lump of prior on the null (H 0 : μ = 0 vs. H 1 : μ ≠ 0) , while A is using a one-sided test T+ (H 0 : μ ≤ 0 vs. H 1 : μ >

‹ Prev