Statistical Inference as Severe Testing
Page 38
Dr. Paul Hack, CEO of Best Drug Co., is accused of issuing a report on the benefits of drug X that exploits a smattering of questionable research practices (QRPs): It ignores multiple testing, uses data-dependent hypotheses, and is oblivious to a variety of selection effects. What happens shines a bright spotlight on a mix of statistical philosophy and evidence. The case is fictional; any resemblance to an actual case is coincidental.
Round 1
The prosecution marshals their case, calling on a leader of an error statistical tribe who is also a scientist at Best: Confronted with a lack of statistically significant results on any of 10 different prespecified endpoints (in randomized trials on Drug X), Dr. Hack proceeded to engage in a post-data dredging expedition until he unearthed a subgroup wherein a nominally statistically significant benefit [B] was found. That alone was the basis for a report to share-holders and doctors that Drug X shows impressive benefit on factor B. Colleagues called up by the prosecution revealed further details: Dr. Hack had ordered his chief data analyst to “ shred and dice the data into tiny julienne slices until he got some positive results” sounding like the adman for Chop-o-Matic. The P-value computed from such a post-hoc search cannot be regarded as an actual P-value, yet Dr. Hack performed no adjustment of P-values, nor did he disclose the searching expedition had been conducted. Moreover, we learn, the primary endpoint describing the basic hypothesis about the mechanism by which Drug X might offer benefits attained a non-significant P-value of 0.52. Despite knowing the FDA would almost certainly not approve drug X based on post-hoc searching, Dr. Hack optimistically reported on profitability for Best, thanks to the “ positive” trials on drug X .
Anyone who trades Biotech stocks knows that when a company reports: ‘ We failed to meet primary and perhaps secondary endpoints,’ the stock is negatively affected, at times greatly. When one company decides to selectively report or be overly optimistic, it’ s unfair to patients and stockholders.
Round 2
Next to be heard from are defenders of Dr. Hack: There’ s no need to adjust for post-hoc data dredging, the fact that significance tests require such adjustments is actually one of their big problems. What difference does it make if Dr. Hack intended to keep trying and trying again until he found something? Intentions are irrelevant to the import of data. Others insist that: the position on cherry picking is open to debate, depending on one’ s philosophy of evidence. For the courts to take sides would set an ominous precedent. They cite upstanding statisticians who can attest to the irrelevance of such considerations.
Round 3
A second wave of Hack’ s defenders (which could be the same as in Round 2) pile on, with a list of reasons for P -phobia: Significance levels exaggerate evidence, force us into dichotomous thinking, are sensitive to sample size, aren’ t measures of evidence because they aren’ t comparative reports, and violate the likelihood principle . Even expert prosecutors, they claim, construe a P -value as the probability the results are due to chance, which is to commit the prosecutor’ s fallacy (misinterpreting P -values as posterior probabilities), so they are themselves confused.
Dr. Hack’ s lawyer jumps at the opening before him: You see there is disagreement among scientists, at a basic philosophical level. To hold my client accountable would be no different than banning free and open discussion of rival interpretations of data amongst scientists.
Dr. Hack may not get off the hook in the end – at least in fields where best practice manuals encode taking account of selection effects – but that could change if enough people adopt the stance of friends of Hack. In any event, it is not too much of a caricature of actual debates taking place. You, the citizen scientist, have the tough job of sifting through the cacophony. Severity principle in hand, you can at least decipher where the controversy is coming from. To limber up you might disinter the sources of claims of Round 3.
4.6 Error Control is Necessary for Severity Control
To base the choice of the test of a statistical hypothesis upon an inspection of the observations is a dangerous practice; a study of the configuration of a sample is almost certain to reveal some feature, or features, which are exceptional if the [chance] hypothesis is true.
(Pearson and Chandra Sekar 1936 , p. 127)
The likelihood principle implies… the irrelevance of predesignation, of whether an hypothesis was thought of beforehand or was introduced to explain the known effects.
(Rosenkrantz 1977 , p. 122)
Here we encounter the same source of tribal rivalry first spotted in Excursion I with optional stopping (Tour II ). Yet we also allowed that data dependencies, double counting and non-novel results are not always problematic. The advantage of the current philosophy of statistics is that it makes it clear that the problem – when it is a problem – is that these gambits alter how well or severely probed claims are. We defined problematic cases as those where data or hypotheses are selected or generated, or a test criterion is specified, in such a way that the minimal severity requirement is violated, altered (without the alteration being mentioned), or unable to be assessed (Section 2.4 ).
Because they alter the severity, they must be taken account of in auditing a result, which includes checking for (i) selection effects, (ii) violations of model assumptions, and (iii) obstacles to any move from statistical to substantive causal or other theoretical claims.
There is no point in raising thresholds for significance if your methodology does not pick up on biasing selection effects. Yet, surprisingly, that is the case for many of the methods advocated by critics of significance tests, and related error statistical methods.
Two problems that plague frequentist inference: multiple comparisons and multiple looks, or, as they are more commonly called, data dredging and peeking at the data. The frequentist solution to both problems involves adjusting the P value … But adjusting the measure of evidence because of considerations that have nothing to do with the data defies scientific sense, belies the claim of ‘ objectivity’ that is often made for the P value.
(Goodman 1999 , p. 1010)
This is epidemiologist Steven Goodman. 1 To his credit, he recognizes the philosophical origins of his position.
Older arguments from Edwards, Lindman, and Savage (1963 ) (E, L, & S) live on in contemporary forms. The Bayesian psychologist Eric-Jan Wagenmakers tells us:
[I]f the sampling plan is ignored, the researcher is able to always reject the null hypothesis, even if it is true. This example is sometimes used to argue that any statistical framework should somehow take the sampling plan into account. Some people feel that ‘ optional stopping’ amounts to cheating … This feeling is, however, contradicted by a mathematical analysis.
(2007 , p. 785)
Being contradicted by mathematics is a heavy burden to overcome. Look closely and you’ ll see we are referred to E, L, & S. But the “ proof” assumes the Likelihood Principle (LP) by which error probabilities drop out (Section 1.5 ). Error probabilities and severity are altered, but if your account has no antennae to pick up on them, then, to you, there’ s no effect.
Holders of the LP point fingers at error statisticians for worrying about “ the sample space” and “ intentions.” To leaders of movements keen to rein in researcher flexibility, by contrast, a freewheeling attitude toward data-dependent hypotheses and stopping rules is pegged as a major source of spurious significance levels. Simmons, Nelson, and Simonsohn (2011 ) list as their first requirement for authors: “ Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article” (p. 1362). I’ d relax it a little, requiring they report how their stopping plan alters the relevant error probability. So let me raise an “ either or” question: Either your methodology picks up on influences on error probing capacities of methods or it does not. If it does, then you are in sync with the minimal severity requirement. We may compare our different ways of satisfying it. If it does not, then we’ ve hit a crucial nerve. If you care, but
your method fails to reflect that concern, then a supplement is in order. Opposition in methodology of statistics is fighting over trifles if it papers over this crucial point. If there is to be a meaningful “ reconciliation,” it will have to be here.
I’ m raising this in a deliberate stark, provocative fashion to call attention to an elephant in the room (or on our ship). The widespread concern of a “ crisis of replication” has nearly everyone rooting for predesignation of aspects of the data analysis, but if they’ re not also rooting for error control, what are they cheering for? Let’ s allow there are other grounds to champion predesignation, so don’ t accuse me of being unfair … yet . The tester sticks her neck out: she requires it to have a direct effect on inferential measures.
Paradox of Replication
We often hear it’ s too easy to obtain small p -values, yet replication attempts find it difficult to get small p -values with preregistered results. This shows the problem isn’ t p -values but failing to adjust them for cherry picking, multiple testing, post-data subgroups and other biasing selection effects .
(Mayo 2016 , p. 1)
Criticism assumes a standard. If you criticize hunting and snooping because of the lack of control of false positives (Type I errors), then the assumption is that those matter. Suppose someone claims it’ s too easy to satisfy standard significance thresholds, while chiming in with those bemoaning the lack of replication of statistically significant results.
Critic of tests: It’ s too easy to get low significance levels.
You: Why is it so hard to get small P -values in replication research?
Wait for their answer.
Critic: Aside from expected variability in results, there was likely some P -hacking, cherry picking, and other QRPs in the initial studies.
You: So, I take it you want methods that pick up on these biasing effects, and you favor techniques to check or avoid them.
Critic: Actually I think we should move to methods where selection effects make no difference.
(Bayes factors, Bayesian posteriors)
Now what? One possibility is a belief in magical thinking, that ignoring biasing effects makes them disappear. That doesn’ t feel very generous. Or maybe the unwarranted effects really do disappear. Not to the tester. Consider the proposals from Tour II: Bayes ratios, with or without their priors. Imagine my low P -value emerged from the kind of data dredging that threatens the P -value’ s validity. I go all the way to getting a z -value of 2.5, apparently satisfying Johnson’ s approach. I erect the maximally likely alternative, it’ s around 20 times more likely than the null, and am entitled, on this system, to infer a posterior of 0.95 on H max . Error statistical testers would complain that the probability of finding a spurious P -value this way is high; if they are right, as I think they are, then the probability of finding a spurious posterior of 0.95 is just as high . That is why Bayes factors are at odds with the severity requirement. I’ m not saying in principle they couldn’ t be supplemented in order to control error probabilities – nor that if you tell Bayesians what you want, they can’ t arrange it (with priors, shrinkage, or what have you). I’ m just saying that the data-dredged hypothesis that finds its way into a significance test can also find its way into a Bayes factor. There’ s one big difference. I have error statistical grounds to criticize the former. If I switch tribes to one where error probabilities are irrelevant, my grounds for criticism disappear.
The valid criticism of our imaginary Dr. Hack, in Round 1, is this: he purports to have found an effect that would be difficult to generate if there were no genuine discrepancy from the null, when in fact it is easy to generate it. It is frequently brought about in a world where the null hypothesis is true. The American Statistical Association’ s statement on P -values (2016, p. 131) correctly warns, “ [c]onducting multiple analyses of the data and reporting only those with certain p -values” leads to spurious statistical levels. Their validity is lost, and the alarm goes off when we audit. When Hack’ s defenders maintain that scientists should not be convicted for engaging in the all-important task of exploration, you can readily agree. But you can still insist the results of explorations be independently tested or separately defended. If you’ re not controlling error probabilities, however, there’ s no alarm bell. This leads to the next group, Round 2, declaring that it makes no sense to adjust measures of evidence “ because of considerations that have nothing to do with the data,” thereby denying the initial charge against poor Dr. Hack, or should I say, lucky Dr. Hack, because the whole matter has now come down to something “ quasi-philosophical,” a murky business at best. Round 3 piles on with ‘ P -values are invariably misinterpreted’ , and no one really likes them much anyway. The P -value is the most unpopular girl in the class and I wouldn’ t even take you to visit P -value tribes – or “ cults” as some call them 2 – I prefer speaking of observed significance levels, if it weren’ t that they suddenly have occupied so much importance in the statistics wars.
You might say that even if some people deny that selection effects actually alter the “ evidence,” the question of whether they can be ignored in interpreting data in legal or policy settings is not open for debate. After all, statutes in law and medicine require taking them into account. For example, the Reference Manual on Scientific Evidence for lawyers makes it clear that finding post-data subgroups that show impressive effects – when primary and secondary endpoints are insignificant – calls for adjustments. In a chapter by David Kaye and David Freedman, they emphasize the importance of asking:
How many tests have been performed? Repeated testing complicates the interpretation of significance levels. If enough comparisons are made, random error almost guarantees that some will yield ‘ significant’ findings, even when there is no real effect …
If a fair coin is tossed a few thousand times, it is likely that at least one string of ten consecutive heads will appear. … Ten heads in the first ten tosses means one thing [evidence the coin is biased]; a run of ten heads somewhere along the way to a few thousand tosses of a coin means quite another.
(Kaye and Freedman 2011 , pp. 127– 8)
Nevertheless, statutes can be changed if their rationale is overturned. The issue is not settled. There are regularly cases where defenders emphasize the lack of consensus and argue precisely as defenders in Round 2. 3
Error Control is Only for Long Runs.
But wait a minute. I am overlooking the reasons given for ignoring error control, some even in direct reply to Mayo (2016 ):
Bayesian analysis does not base decisions on error control. Indeed, Bayesian analysis does not use sampling distributions … As Bayesian analysis ignores counterfactual error rates, it cannot control them.
(Kruschke and Liddell 2017 , pp. 13, 15)
This recognition may be a constructive step, admitting lack of error control. But on their view, caring about error control could only matter if you were in the business of performance in the long run. Is this true? By the way, the error control is actual; counterfactuals are used in determining what they are. Kruschke and Liddell continue:
[I]f the goal is specifically to control the rate of false alarms when the decision rule is applied repeatedly to imaginary data from the null hypothesis, then, by definition, the analyst must compute a p value and corresponding CIs. … the analyst must take into account the exact stopping and testing intentions. … any such procedure does not yield the credibility of parameter values … but instead yields the probability of imaginary data … .
(ibid., p. 15)
What’ s this about imaginary data? Suppose a researcher reports data x claiming to show impressive improvement in an allergic reaction among patients given a drug. They don’ t report that they dredged different measures of improvement or searched through post-hoc subgroups among patients. The actual data x only showed positive results on allergic reaction, let’ s say. Why report imaginary data that could have resulted but didn’ t? Looking at imaginary data sounds really damn
ing, but when we see what they mean, we’ re inclined to regard it a scandal to hide such information. The problem is not about long runs – lose the alleged time aspect. I can simulate the sampling distribution to see the relative frequencies any day of the week. I can show you if a method had very little capability of reporting bupkis when it should have reported bupkis (nothing). It didn’ t do its job today. If Kruschke and Liddell’ s measure of credibility ignores what’ s regarded as QRPs in common statutes of evidence, you may opt for error control. True, error control is a necessary, not a sufficient, condition for a severity assessment. It follows from the necessity that an account with poor error control can neither falsify nor corroborate with severity.
Direct Protection Requires Skin off Your Nose.
Testers are prepared to admit a difference in goals. Rival tribes may say lack of error control is no skin off their noses, since what they want are posterior probabilities and Bayes factors. Severe testers, on the other hand, appeal to statistics for protection against being misled by a cluster of biases and mistakes. A method directly protects against a QRP only insofar as its commission is skin off its nose. It must show up in the statistical analysis. More than that, there must be a general rationale for the concern. That a Bayesian probabilist avoids selection effects that hurt severity doesn’ t automatically mean they do so directly because of the severity violation. Moreover, it shouldn’ t be an option whether to take account of them or not, they must be. The tester holds this even granting there are cases where auditing shows no damage to severity.