Statistical Inference as Severe Testing

Page 45

by Deborah G Mayo

Moreover, it’ s selection is within this family. But we know that all these LRMs are statistically inadequate! As with other purely comparative measures, there’ s no falsification of models.

What if we start with the adequate model that the PR arrived at, the autoregressive model with a trend? In that case, the AIC ranks at the very top of the model with the wrong number of trends. That is, it ranks a statistically inadequate model higher than the statistically adequate one. Moreover, the Akaike method for ranking isn’ t assured of having decent error probabilities. When the Akaike ranking is translated into a N-P test comparing this pair of models, the Type I error probability is around 0.18, and no warning of the laxity is given. As noted, model selection methods don’ t hone in on models outside the initial family. By contrast, building the model through our M-S approach is intended to accomplish both tasks – building and checking – in one fell swoop.

Leading proponents of AIC, Burnham and Anderson (2014 , p. 627), are quite critical of error probabilities, declaring “ P values are not proper evidence as they violate the likelihood principle (Royall 1997 ).” This tells us their own account forfeits control of error probabilities. “ Burnham and Anderson (2002 ) in their textbook on likelihood methods for assessing models warn against data dredging … But there is nothing in the evidential measures recommended by Burnham and Anderson” to pick up on this (Dienes 2008 , p. 144). See also Spanos (2014 ).

M-S Tests and Predesignation.

Don’ t statistical M-S tests go against the error statistician’ s much-ballyhooed requirement that hypotheses be predesignated? The philosopher of science Rosenkrantz says yes:

[O]rthodox tests … show how to test underlying assumptions of randomness, independence and stationarity, where none of these was the predesignated object of the test (the “ tested hypothesis” ). And yet, astoundingly in the face of all this, orthodox statisticians are one in their condemnation of “ shopping for significance,” picking out significant correlations in data post hoc, or “ hunting for trends…” . It is little wonder that orthodox tests tend to be highly ambivalent on the matter of predesignation.

(Rosenkrantz 1977 , 204– 5)

Are we hoisted by our own petards? No. This is another case where failing to disentangle a rule’ s raison d’ être leads to confusion. The aim of predesignation, as with the preference for novel data, is to avoid biasing selection effects in your primary statistical inference (see Tour III ). The data are remodeled to ask a different question. Strictly speaking our model assumptions are predesignated as soon as we propose a given model for statistical inference. These are the pigeonholes in the PR menu. It has never been a matter of the time – of who knew what, when – but a matter of avoiding erroneous interpretations of the data at hand. M-S tests in the error statistical methodology are deliberately designed to be independent of (or orthogonal to) the primary question at hand. The model assumptions, singly or in groups, arise as argumentative assumptions, ready to be falsified by criticism. In many cases, the inference is as close to a deductive falsification as to be wished.

Parametric tests of assumptions may themselves have assumptions, which is why judicious combinations of varied tests are called upon to ensure their overall error probabilities. Order matters: Tests of the distribution, e.g., Normal, Binomial, or Poisson, assume IID, so one doesn’ t start there. The inference in the case of an M-S test of assumptions is not a statistical inference to a generalization : It’ s explaining given data, as with explaining a “ known effect,” only keeping to the statistical categories of distribution, independence/dependence, and homogeneity/heterogeneity (Section 4.6 ). Rosenkrantz’ s concerns pertain to the kind of pejorative hunting for variables to include in a substantive model. That’ s always kept distinct from the task of M-S testing, including respecifying.

Our argument for a respecified model is a convergent argument: questionable conjectures along the way don’ t bring down the tower (section 1.2). Instead, problems ramify so that the specification finally deemed adequate has been sufficiently severely tested for the task at hand. The trends and perhaps the lags that are required to render the statistical model adequate generally cry out for a substantive explanation. It may well be that different statistical models are adequate for probing different questions. 5 Violated assumptions are responsible for a good deal of non-replication, and yet it has gone largely unattended in current replication research.

Take-away of Excursion 4.

For a severe tester, a crucial part of a statistical method’ s objectivity (Tour I) is registering how test specifications such as sample size (Tour II ) and biasing selection effects (Tour III ) alter its error-probing capacities. Testing assumptions (Tour IV) is also crucial to auditing. If a probabilist measure such as a Bayes factor is taken as a gold standard for critiquing error statistical tests, significance levels and other error probabilities appear to overstate evidence – at least on certain choices of priors. From the perspective of the severe tester, it can be just the reverse. Preregistered reports are promoted to advance replication by blocking selective reporting. Thus there is a tension between preregistration and probabilist accounts that downplay error probabilities, that declare them only relevant for long runs, or tantamount to considering hidden intentions. Moreover, in the interest of promoting Bayes factors, researchers who most deserve censure are thrown a handy life preserver. Violating the LP, using the sampling distribution for inferences with the data at hand, and the importance of error probabilities form an interconnected web of severe testing. They are necessary for every one of the requirements for objectivity.

1 The sufficient statistic in the negative Binomial case is N , the number of trials until the fourth success. In the Binomial case, it is X̅ (Cox and Mayo 2010 , p. 286).

2 There is considerable discussion as to which involve pejorative “ double use” of data, and which give adequate frequentist guarantees or calibrations (Ghosh et al. 2006, pp. 175– 84; Bayarri and Berger 2004 ). But I find their rationale unclear.

3 The bootstrapped distribution is conditional on the observed x .

4 For each yt form the residual squared. The sum of the squared residuals:

gives an estimate of σ 2 for the model.

The AIC score for each contender in the case of the LRM, with sample size n , is log( ) + 2K /n , where K is the number of parameters in model i . The models are then ranked with the smallest being preferred. The log-likelihood is the goodness-of-fit measure which is traded against simplicity, but if the statistical model is misspecified, one is using the wrong measure of fit.

For a comparison of the AIC using these data, and a number of related model-selection measures, see Spanos (2010a ). None of these points change using the unbiased variant of AIC.

5 When two different models capture the data adequately, they are called reparameterizations of each other.

Excursion 5

Power and Severity

Itinerary

Tour I Power: Pre-data and Post-data 5.1 Power Howlers, Trade-offs, and Benchmarks

5.2 Cruise Severity Drill: How Tail Areas (Appear to) Exaggerate the Evidence

5.3 Insignificant Results: Power Analysis and Severity

5.4 Severity Interpretation of Tests: Severity Curves

Tour II How Not to Corrupt Power 5.5 Power Taboos, Retrospective Power, and Shpower

5.6 Positive Predictive Value: Fine for Luggage

Tour III Deconstructing the N-P versus Fisher Debates 5.7 Statistical Theatre: “ Les Miserables Citations”

5.8 Neyman’ s Performance and Fisher’ s Fiducial Probability

Tour I

Power: Pre-data and Post-data

A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects. In psychology, and especially in soft psychology, under the sway of the Fisherian scheme, there has been little consciousness of how big things are.

(Cohen 1990 , p. 1309)

So how would you use power to cons
ider the magnitude of effects were you drawn forcibly to do so? In with your breakfast is an exercise to get us started on today’ s shore excursion.

Suppose you are reading about a statistically significant result x (just at level α ) from a one-sided test T+ of the mean of a Normal distribution with n IID samples, and known σ : H 0 : µ ≤ 0 against H 1 : µ > 0 .

Underline the correct word, from the perspective of the (error statistical) philosophy, within which power is defined.

If the test’ s power to detect µ ′ is very low (i.e., POW(µ ′ ) is low), then the statistically significant x is poor/good evidence that µ > µ ′ .

Were POW(µ ′ ) reasonably high, the inference to µ > µ ′ is reasonably/poorly warranted.

We’ ve covered this reasoning in earlier travels (e.g., Section 4.3 ), but I want to launch our new tour from the power perspective. Assume the statistical test has passed an audit (for selection effects and underlying statistical assumptions) – you can’ t begin to analyze the logic if the premises are violated.

During our three tours on Power Peninsula, a partially uncharted territory, we’ ll be residing at local inns, not returning to the ship, so pack for overnights. We’ ll visit its museum, but mostly meet with different tribal members who talk about power – often critically. Power is one of the most abused notions in all of statistics, yet it’ s a favorite for those of us who care about magnitudes of discrepancies. Power is always defined in terms of a fixed cut-off, c α , computed under a value of the parameter under test; since these vary, there is really a power function . If someone speaks of the power of a test tout court , you cannot make sense of it, without qualification. First defined in Section 3.1 , the power of a test against µ ′ is the probability it would lead to rejecting H 0 when μ = µ ′ :

POW(T, µ ′ ) = Pr(d( X ) ≥ c α ; μ = µ ′ ), or Pr(test T rejects H 0 ; μ = µ ′ ).

If it’ s clear what the test is, we just write POW(µ ′ ). Power measures the capability of a test to detect µ ′ – where the detection is in the form of producing a d ≥ c α . While power is computed at a point μ = µ ′ , we employ it to appraise claims of form μ > µ ′ or μ < µ ′ .

Power is an ingredient in N-P tests, but even practitioners who declare they never set foot into N-P territory, but live only in the land of Fisherian significance tests, invoke power. This is all to the good, and they shouldn’ t fear that they are dabbling in an inconsistent hybrid.

Jacob Cohen’ s (1988 ) Statistical Power Analysis for the Behavioral Sciences is displayed at the Power Museum’ s permanent exhibition. Oddly, he makes some slips in the book’ s opening. On page 1 Cohen says: “ The power of a statistical test is the probability it will yield statistically significant results.” Also faulty is what he says on page 4: “ The power of a statistical test of a null hypothesis is the probability that it will lead to the rejection of the null hypothesis, i.e., the probability that it will result in the conclusion that the phenomenon exists.” Cohen means to add “ computed under an alternative hypothesis,” else the definitions are wrong. These snafus do not take away from Cohen’ s important tome on power analysis, yet I can’ t help wondering if these initial definitions play a bit of a role in the tendency to define power as ‘ the probability of a correct rejection,’ which slips into erroneously viewing it as a posterior probability (unless qualified).

Although keeping to the fixed cut-off c α is too coarse for the severe tester’ s tastes, it is important to keep to the given definition for understanding the statistical battles. We’ ve already had sneak previews of “ achieved sensitivity” or “ attained power” [Π ( γ ) = Pr(d( X ) ≥ d( x 0 ); µ 0 + γ )] by which members of Fisherian tribes are able to reason about discrepancies (Section 3.3 ). N-P accorded three roles to power: the first two are pre-data, for planning and comparing tests; the third is for interpretation post-data. It’ s the third that they don’ t announce very loudly, whereas that will be our main emphasis. Have a look at this museum label referring to a semi-famous passage by E. Pearson. Barnard (1950 , p. 207) has just suggested that error probabilities of tests, like power, while fine for pre-data planning, should be replaced by other measures (likelihoods perhaps?) after the trial. What did Egon say in reply to George?

[I]f the planning is based on the consequences that will result from following a rule of statistical procedure, e.g., is based on a study of the power function of a test and then, having obtained our results, we do not follow the first rule but another, based on likelihoods, what is the meaning of the planning?

(Pearson 1950 , p. 228)

This is an interesting and, dare I say, powerful reply, but it doesn’ t quite answer George. By all means apply the rule you planned to, but there’ s still a legitimate question as to the relationship between the pre-data capability or performance measure, and post-data inference. The severe tester offers a view of this intimate relationship. In Tour II we’ ll be looking at interactive exhibits far outside the museum, including N-P post-data power analysis, retrospective power, and a notion I call shpower. Employing our understanding of power, scrutinizing a popular reinterpretation of tests as diagnostic tools will be straightforward. In Tour III we go a few levels deeper in disinterring the N-P vs. Fisher feuds. I suspect there is a correlation between those who took Fisher’ s side in the early disputes with Neyman and those leery of power. Oscar Kempthorne being interviewed by J. Leroy Folks (1995 ) said:

Well, a common thing said about [Fisher] was that he did not accept the idea of the power. But, of course, he must have. However, because Neyman had made such a point about power, Fisher couldn’ t bring himself to acknowledge it.

(p. 331)

However, since Fisherian tribe members have no problem with corresponding uses of sensitivity, P -value distributions, or CIs, they can come along on a severity analysis. There’ s more than one way to skin a cat, if one understands the relevant statistical principles. The issues surrounding power are subtle, and unraveling them will require great care, so bear with me. I will give you a money-back guarantee that by the end of the excursion you’ ll have a whole new view of power. Did I mention you’ ll have a chance to power the ship into port on this tour? Only kidding, however, you will get to show your stuff in a Cruise Severity Drill (Section 5.2 ).

5.1 Power Howlers, Trade-offs, and Benchmarks

In the Mountains out of Molehills (MM) Fallacy (Section 4.3 ), a rejection of H 0 just at level α with a larger sample size (higher power) is taken as evidence of a greater discrepancy from H 0 than with a smaller sample size (in tests otherwise the same). Power can be increased by increasing sample size, but also by computing it in relation to alternatives further and further from H 0 . Some are careful to avoid the MM fallacy when the high power is due to large n , but then fall right into it when it is due to considering a very discrepant µ ′ . For our purposes, our one-sided T+ will do.

Mountains out of Molehills (MM) Fallacy (second form). Test T+: The fallacy of taking a just statistically significant difference at level α (i.e., d( x 0 ) = d α ) as a better indication of a discrepancy µ ′ if the POW(µ ′ ) is high, than if POW(µ ′ ) is low.

Two Points Stephen Senn Correctly Dubs Nonsense and Ludicrous

Start with an extreme example: Suppose someone is testing H 0 : the drug cures no one. An alternative H 1 is it cures nearly everyone. Clearly these are not the only possibilities. Say the test is practically guaranteed to reject H 0 , if in fact H 1 , the drug cures practically everyone. The test has high power to detect H 1 . You wouldn’ t say that its rejecting H 0 is evidence of H 1 . H 1 entails it’ s very probable you’ ll reject H 0 ; but rejecting H 0 doesn’ t warrant H 1. To think otherwise is to allow problematic statistical affirming the consequent – the basis for the MM fallacy (Section 2.1). This obvious point lets you zero in on some confusions about power.

Stephen Senn’ s contributions to statistical foundations are once again spot on. In drug development, it is typical to require a high pow
er of 0.8 or 0.9 to detect effects deemed of clinical relevance. The clinically relevant discrepancy, as Senn sees it, is the discrepancy “ one should not like to miss” (2007 , p. 196). Senn labels this delta Δ . He is considering a difference between means, so the null hypothesis is typically 0. We’ ll apply severity to his example in Exhibit (iv) of this tour. Here the same points will be made with respect to our one-sided Normal test T+: H 0 : μ ≤ μ 0 vs. H 1 : μ > μ 0 , letting μ 0 = 0, σ known. We may view Δ as the value of μ of clinical relevance. (Nothing changes in this discussion if it’ s estimated as s .) The test takes the form

Reject H 0 iff Z ≥ z α (Z is the standard Normal variate).

“ Reject H 0 ” is the shorthand for “ infer a statistically significant difference” at the level of the test. Though Z is the test statistic, it makes for a simpler presentation to use the cut-off for rejection in terms of α : Reject H 0 iff .

Let’ s abbreviate the alternative against which test T+ has 0.8 power by μ .8 , when it’ s clear what test we’ re talking about. So POW( μ .8 ) = 0.8, and let’ s suppose μ .8 is the clinically relevant difference Δ . Senn asks, what does μ .8 mean in relation to what we are entitled to infer when we obtain statistical significance? Can we say, upon rejecting the null hypothesis, that the treatment has a clinically relevant effect, i.e., μ ≥ μ .8 (or μ > μ .8 )?

“ This is a surprisingly widespread piece of nonsense which has even made its way into one book on drug industry trials” (ibid., p. 201). The reason it is nonsense, Senn explains, is that μ .8 must be in excess of the cut-off for rejection, in particular, (where ). We know we are only entitled to infer μ exceeds the lower bound of the confidence interval at a reasonable level; whereas, μ .8 is actually the upper bound of a 0.8 (one-sided) confidence interval, formed having observed All we are to infer, officially, from just reaching the cut-off , is that μ >

‹ Prev Next ›