Statistical Inference as Severe Testing

Page 37

by Deborah G Mayo

(Johnson 2013b , pp. 1720– 1)

According to Johnson, Bayesians reject hypotheses based on a sufficiently high posterior and “ the alternative hypothesis is accepted if BF10 > k ” (ibid., p. 1726, k for his γ ). A weaker stance might stop with the comparative report Lik( μ max )/Lik( μ 0 ). It’ s good that he supplies a falsification rule.

Johnson views his method as showing how to specify an alternative hypothesis – he calls it the “ implicitly tested” alternative (ibid., p. 1739) – when H 0 is rejected. H 0 and H 1 are each given a 0.5 prior. Unlike N-P, the test does not exhaust the parameter space, it’ s just two points.

[D]efining a Bayes factor requires the specification of both a null hypothesis and an alternative hypothesis, and in many circumstances there is no objective mechanism for defining an alternative hypothesis. The definition of the alternative hypothesis therefore involves an element of subjectivity and it is for this reason that scientists generally eschew the Bayesian approach toward hypothesis testing.

(Johnson 2013a , p. 19313)

He’ s right that comparative inference, as with Bayes factors, leaves open a wide latitude of appraisals by dint of the alternative chosen, and any associated priors.

In his attempt to rein in that choice, Johnson offers an illuminating way to relate the Bayes factor and the standard cut-offs for rejection, at least in UMP tests such as this. (He even calls it a uniformly most powerful Bayesian test!) We focus on the cases where we just reach statistical significance at various levels. Setting k as the Bayes factor you want, you can obtain the corresponding cut-off for rejection by computing √ (2logk ): this matches the z α corresponding to a N-P, UMP one-sided test. The UMP test T+ is of the form: Reject H 0 iff where /√ n , which is z α σ /√ n for the case μ 0 = 0 . Thus he gets (2013b, p. 1730)

Since this is the alternative under which the observed data, which we are taking to be , have maximal probability, write it as H max and μ 1 as μ max . The computations are rather neat, see Note 10. (The last row of Table 4.3 gives an equivalent form.) The reason the LR in favor of the (maximal) alternative gets bigger and bigger is that Pr( x ; H 0 ) is getting smaller and smaller with increasingly large x values.

Table 4.3 V. Johnson’ s implicit alternative analysis for T+: H 0 : μ ≤ 0 vs. H1 : μ > 0

P -value one-sided Lik( μ max )/Lik( μ 0 ) μ max Pr(H 0 |x ) Pr(H max |x )

0.05 1.65 3.87 1.65σ /√ n 0.2 0.8

0.025 1.96 6.84 1.96σ /√ n 0.128 0.87

0.01 2.33 15 2.33σ /√ n 0.06 0.94

0.005 2.58 28 2.58σ /√ n 0.03 0.97

0.0005 3.29 227 3.3σ /√ n 0.004 0.996

√ (2 log k ) /√ n 1/(1 + k ) k /(1 + k )

Johnson’ s approach is intended to “ provide a new form of default, non subjective Bayesian tests” (2013b , p. 1719), and he extends it to a number of other cases as well. Given it has the same rejection region as a UMP error statistical test, he suggests it “ can be used to translate the results of classical significance tests into Bayes factors and posterior model probabilities” (ibid.). To bring them into line with the BF, however, you’ ll need a smaller α level. Johnson recommends levels more like 0.01 or 0.005. Is there anything lost in translation?

There’ s no doubt that if you reach a smaller significance level in the same test, the discrepancy you are entitled to infer is larger. You’ ve made the hurdle for rejection higher: any observed mean that makes it over must be larger. It also means that more will fail to make it over the hurdle: the Type II error probability increases. Using the 1.96 cut-off, a discrepancy of 2.46, call it 2.5, will be detected 70% of the time – add 0.5 SE to the cut-off – (the Type II error is 0.3) whereas using a 2.6 cut-off has less than 50% (0.46) chance of detecting a 2.5 discrepancy (Type II error of 0.54!). Which is a better cut-off for rejection? The severe tester eschews rigid cut-offs. In setting up a test, she looks at the worst cases she can live with; post-data she reports the discrepancies well or poorly warranted at the attained levels. (Recall, discrepancy always refers to parameter values.) Johnson proposes to make up for the loss of power by increasing the sample size, but it’ s not that simple. We know that as sample size increases, the discrepancy indicated by results that reach a given level of significance decreases. Still, you get a Bayes factor and a default posterior probability that you didn’ t have with ordinary significance tests. What’ s not to like?

We perform our two-part criticism, based on the minimal severity requirement. The procedure under the looking glass is: having obtained a statistically significant result, say at the 0.005 level, reject H 0 in favor of H max : μ = μ max . Giving priors of 0.5 to both H 0 and H max you can report the posteriors. Clearly, (S-1) holds: H max accords with – it’ s equal to it. Our worry is with (S-2). H 0 is being rejected in favor of H max , but should we infer it? The severity associated with inferring μ is as large as μ max is

Pr(Z < z α ; μ = μ max ) = 0.5.

This is our benchmark for poor evidence. So (S-2) doesn’ t check out. You don’ t have to use severity, just ask: what confidence level would permit the inference μ ≥ μ max (answer 0.5). Yet Johnson assigns Pr(H max | x ) = 0.97. H max is comparatively more likely than H 0 as moves further from 0 – but that doesn’ t mean we’ d want to infer there’ s evidence for H max . If we add a column to Table 4.1 for SEV( μ ≥ μ max ) it would be 0.5 all the way down!

To have some numbers, in our example ( H 0 : μ ≤ 0 vs. H 1 : μ > 0), σ = 1, n = 25 , and the 0.005 cut-off is 2.58σ /√ n = 0.51 , round to 0.5. When a significance tester says the difference is statistically significantly greater than 0 at the 0.005 level, she isn’ t saying anything as strong as “ there is fairly good evidence that μ = 0.5 .” Here it gets a posterior of 0.97. While the goal of the reform was to tamp down on “ overstated evidence,” it appears to do just the opposite from a severe tester’ s perspective.

How can I say it’ s lousy if it’ s the maximally likely estimate? Because there is the variability of the estimator, and statistical inference must take this into account. It’ s true that the error statistician’ s inference isn’ t the point alternative these critics want us to consider (H max ), but they’ re the ones raising the criticism of ostensive relevance to us, and we’ re struggling in good faith to see what there might be in it. Surely to infer μ = 0.5 is to infer μ > 0.4 . Our outcome of 0.5 is 0.5 standard error in excess of 0.4, resulting in SEV( μ > 0.4) = 0.7 . Still rather poor. Equivalently, it is to form the 0.7 lower confidence limit ( μ > 0.4 ).

Johnson (2013a , p. 19314) calls the 0.5 spikes equipoise, but what happened to the parameter values in between H 0 and H max ? Do they get a prior of 0? To be clear, it may be desirable or at least innocuous for a significance tester to require smaller P -values. What is not desirable or innocuous is basing the altered specification on a BF appraisal, if in fact it is an error statistical justification you’ re after. Defenders of the argument may say, they’ re just showing the upper bound of evidence that can accrue, even if we imagine being as biased as possible against the null and for the alternative. But are they? A fair assessment, say Casella and R. Berger, wouldn’ t have the spike prior on the null – yes, it’ s still there. If you really want a match, why not use the frequentist matching priors for this case? (Prior 3 in Exhibit vii) The spiked prior still has a mismatch between BF and P -value. 10 This is the topic of megateam battles. (Benjamin et al. 2017 and Lakens et al. 2018).

Exhibit (viii): Whether P -values Exaggerate Depends on Philosophy.

When a group of authors holding rather different perspectives get together to examine a position, the upshot can take them out of their usual comfort zones. We need more of that. (See also the survey in Hurlbert and Lombardi 2009 , and Haig 2016.) Here’ s an exhibit from Greenland et al. (2016 ). They greet each member of a list of incorrect interpretations of P -values with “ No!” , but then make this exciting remark:

There are other interpretations of P values that are controversial, in tha
t whether a categorical ‘‘ No!’’ is warranted depends on one’ s philosophy of statistics and the precise meaning given to the terms involved. The disputed claims deserve recognition if one wishes to avoid such controversy. …

For example, it has been argued that P values overstate evidence against test hypotheses, based on directly comparing P values against certain quantities (likelihood ratios and Bayes factors) that play a central role as evidence measures in Bayesian analysis … Nonetheless, many other statisticians do not accept these quantities as gold standards, and instead point out that P values summarize crucial evidence needed to gauge the error rates of decisions based on statistical tests (even though they are far from sufficient for making those decisions). Thus, from this frequentist perspective, P values do not overstate evidence and may even be considered as measuring one aspect of evidence … with 1 − P measuring evidence against the model used to compute the P value.

(p. 342)

It’ s erroneous to fault one statistical philosophy from the perspective of a philosophy with a different and incompatible conception of evidence or inference. The severity principle always evaluates a claim as against its denial within the framework set. In N-P tests, the frame is within a model, and the hypotheses exhaust the parameter space. Part of the problem may stem from supposing N-P tests infer a point alternative, and then seeking that point. Whether you agree with the error statistical form of inference, you can use the severity principle to get beyond this particular statistics battle.

Souvenir R: The Severity Interpretation of Rejection (SIR)

In Tour II you have visited the tribes who lament that P -values are sensitive to sample size (Section 4.3 ), and they exaggerate the evidence against a null hypothesis (Sections 4.4 , 4.5 ). We’ ve seen that significance tests take into account sample size in order to critique the discrepancies indicated objectively. A researcher may choose to decrease the P -value as n increases, but there’ s no problem in understanding that the same P -value reached with a larger sample size indicates fishing with a finer mesh. Surely we should not commit the fallacy exposed over 50 years ago.

Here’ s a summary of the severe tester’ s interpretation (of a rejection) putting it in terms that seem most clear:

SIR: The Severity Interpretation of a Rejection in test T+: (small P-value)

(i): [Some discrepancy is indicated]: d( x 0 ) is a good indication of µ > µ 1 = µ 0 + γ if there is a high probability of observing a less statistically significant difference than d( x 0 ) if µ = µ 0 + γ .

N-P and Fisher tests officially give the case with γ = 0. In that case, what does a small P -value mean? It means the test very probably (1 − P ) would have produced a result more in accord with H 0 , were H 0 an adequate description of the data-generating process. So it indicates a discrepancy from H 0 , especially if I can bring it about fairly reliably. To avoid making mountains out of molehills, it’ s good to give a second claim about the discrepancies that are not indicated:

(ii): [I’ m not that impressed]: d( x 0 ) is a poor indication of µ > µ 1 = µ 0 + γ if there is a high probability of an even more statistically significant difference than d( x 0 ) even if µ = µ 0 + γ .

As for the exaggeration allegation, merely finding a single statistically significant difference, even if audited, is indeed weak: it’ s an indication of some discrepancy from a null, a first step in a task of identifying a genuine effect. But, a legitimate significance tester would never condone rejecting H 0 in favor of alternatives that correspond to a low severity or confidence level such as 0.5. Stephen Senn sums it up: “ Certainly there is much more to statistical analysis than P-values but they should be left alone rather than being deformed … to become second class Bayesian posterior probabilities” (Senn 2015a ). Reformers should not be deformers.

There is an urgency here. Not only do some reforms run afoul of the minimal severity requirement, to suppose things are fixed by lowering P -values ignores or downplays the main causes of non-replicability. According to Johnson:

[I]t is important to note that this high rate of nonreproducibility is not the result of scientific misconduct, publication bias, file drawer biases, or flawed statistical designs; it is simply the consequence of using evidence thresholds that do not represent sufficiently strong evidence in favor of hypothesized effects.

(2013a , p. 19316)

This sanguine perspective sidesteps the worry about the key sources of spurious statistical inferences: biasing selection effects and violated assumptions, at all levels. (Fortunately, recent reforms admit this; Benjamin et al. 2017 .) Catching such misdemeanors requires auditing , the topic of Tours III and IV of this Excursion.

1 SEV(µ ≥ 0.1) with = 0.1 and n = 400 is computed by considering Pr( < 0.1; µ = 0.1). Standardizing yields z = √ 400 (0.1 − 0.1)/1 = 0. So SEV(µ ≥ 0.1) = 0.5!

2 Let’ s use this to illustrate the MM fallacy: Compare (i) n = 100 and (ii) n = 10,000 in the same test T+. With n = 100 , 1SE = 0.1, with n = 10,000, 1SE = 0.01 . The just 0.025 significant outcomes in the two tests are (i) = 0.2 and (ii) = 0.02. Consider the 0.93 lower confidence bound for each. Subtracting 1.5 SE from the outcome yields μ > 0.5(1/√ n ) : (i) for n = 100, the inferred 0.93 lower estimate is. μ > 0.5(1/5) = 0.05 , (ii) for n = 10,000 , the inferred 0.93 lower estimate is μ > 0.5(1/100) = 0.005 . So a difference that is just statistically significant at the same level, 0.025, permits inferring μ > 0.05 when n = 25 , but only µ > 0.005 when n = 10,000 Section 3.7.

3 Subtract 1.5 SE and 0.5 SE from = 0.2, respectively.

4 Casella and R. Berger (1987b ) argue, “ We would be surprised if most researchers would place even a 10% prior probability of H 0 . We hope that the casual reader of Berger and Delampady realizes that the big discrepancies between P-values P(H 0 | x ) … are due to a large extent to the large value of [the prior of 0.5 to H 0 ] that was used.” The most common uses of a point null, asserting the difference between means is 0, or the coefficient of a regression coefficient is 0, merely describe a potentially interesting feature of the population, with no special prior believability. “ Berger and Delampady admit … , P-values are reasonable measures of evidence when there is no a priori concentration of belief about H 0 ” (ibid. , p. 345). Thus, “ the very argument that Berger and Delampady use to dismiss P-values can be turned around to argue for P-values” (ibid. , p. 346).

5 In defending spiked priors, Berger and Sellke move away from the importance of effect size. “ Precise hypotheses … ideally relate to, say, some precise theory being tested. Of primary interest is whether the theory is right or wrong; the amount by which it is wrong may be of interest in developing alternative theories, but the initial question of interest is that modeled by the precise hypothesis test” (1987 , p. 136).

6 The spiked prior drops out, so the result is the same as a uniform prior on the null and alternative.

7 Bernardo shocked his mentor in announcing that the Lindley paradox is really an indictment of the Bayesian computations: “ Whether you call this a paradox or a disagreement, the fact that the Bayes factor for the null may be arbitrarily large for sufficiently large n, however relatively unlikely the data may be under H 0 is, … deeply disturbing” (Bernardo 2010 , p. 59).

8 In the special case, where there’ s appreciable evidence for a special parameter, Senn argues that Jeffreys only required H 1 ’ s posterior probability to be greater than 0.5. One has, so to speak, used up the prior belief by using the spiked prior (Senn 2015a ).

9 The entries for the inverse are useful. This is adapted from Berger and Sellke (1987 ) Table 3.

P-value: one-sided Lik( μ 0 )/Lik( μ max )

0.05 1.65 0.258

0.025 1.96 0.146

0.01 2.33 0.067

0.005 2.58 0.036

0.0005 3.29 0.0044

10

Computations

1. Suppose the outcome is just significant at the α level: .

2. So the most likely alternative is

3. The ra
tio of the maximum likely alternative H max to the likelihood of H 0 is:

This gives the Bayes factor: BF10 . (BF01 would be exp[−z 2 /2].)

4. Set Lik( x | H max )/Lik( x |H 0 ) = k .

5. So exp[z 2 /2] = k.

Since the natural log (ln) and exp are inverses:

log k = log(exp[z 2 /2]) = [z 2 /2];

2 log k = z 2 , so √ (2log k ) = z .

Tour III

Auditing: Biasing Selection Effects and Randomization

This account of the rationale of induction is distinguished from others in that it has as its consequences two rules of inductive inference which are very frequently violated … The first of these is that the sample must be a random one … The other rule is that the character [to be studied] must not be determined by the character of the particular sample taken.

(Peirce 1.95)

The biggest source of handwringing about statistical inference these days boils down to the fact it has become very easy to infer claims that have been subjected to insevere tests. High-powered computerized searches and data trolling permit sifting through reams of data, often collected by others, where in fact no responsible statistical assessments are warranted. “ We’ re more fooled by noise than ever before, and it’ s because of a nasty phenomenon called ‘ big data’ . With big data, researchers have brought cherry picking to an industrial level” (Taleb 2013 ). Selection effects alter a method’ s error probabilities and yet a fundamental battle in the statistics wars revolves around their relevance (Section 2.4 ). We begin with selection effects, and our first stop is to listen in on a court case taking place.

‹ Prev Next ›