Statistical Inference as Severe Testing
Page 40
(ibid., p. 151)
Certainly, the results of this study cast serious doubts on the validity of the psychoanalytic claims regarding the importance of the infant [training] … .
(ibid., pp. 158– 9)
In addition to their astuteness in taking account of searching, notice they’ re not reticent in reporting negative results. Since few if any were statistically significant, Fisher’ s requirement for demonstrating an effect fails. This casts serious doubt on what was a widely accepted basis for recommending Freudian-type infant training. 7 The presumption of a genuine effect is statistically falsified, or very close to it.
When Searching Doesn’ t Damage Severity: Explaining a Known Effect
What is sometimes overlooked is that the problem is not that a method is guaranteed to output some claim or other that fit the data, the problem is doing so unreliably. Some criticisms of adjusting for selection are based on examples where severity is improved by searching. We should not be tempted to run together examples of pejorative data-dependent hunting with cases that are superficially similar, but unproblematic. For example, searching for a DNA match with a criminal’ s DNA is somewhat akin to finding a statistically significant departure from a null hypothesis: “ one searches through data and concentrates on the one case where a ‘ match’ with the criminal’ s DNA is found, ignoring the non-matches.” (Mayo and Cox 2006; p. 94) Isn’ t an error statistician forced to adjust for hunting here as well? No.
In illicit hunting and cherry picking, the concern is that of inferring a genuine effect, when none exists; whereas “ here there is a known effect or specific event, the criminal’ s DNA, and reliable procedures are used to track down the specific cause or source” – or so we assume with background knowledge of a low “ erroneous match” rate. “ The probability is high that we would not obtain a match with person i , if i were not the criminal” ; so, by the severity criterion (or FEV), finding the match is good evidence that i is the criminal. “ Moreover, each non-match found, by the stipulations of the example, virtually excludes that person.” Thus, the more such negative results, the stronger is the evidence against i when a match is finally found. Negative results fortify the inferred match. Since “ at most one null hypothesis of innocence is false, evidence of innocence on one individual increases, even if only slightly, the chance of guilt of another” (ibid).
Philip Dawid (2000 , p. 325) invites his readers to assess whether they are “ intuitive Bayesians or intuitive frequentists” by “ the extreme case that the data base contains records on everyone in the population.” One could imagine this put in terms of ‘ did you hear about the frequentist who thought finding a non-match with everyone but Sam is poor evidence that Sam is guilty?’ The criticism is a consequence of blurring pejorative and non-pejorative cases. It would be absurd to consider the stringency of the probe as diminishing as more non-matches are found. Thus we can remain intuitive frequentist testers. A cartoon shows a man finding his key after searching with the caption “ always the last place you look.” Searching for your lost key is like the DNA search for a criminal. (Note the echoes with the philosophical dispute about the relevance of novel predictions; Section 2.4 .)
If an effect is known to be genuine, then a sought-for and found explanation needn’ t receive a low severity. Don’ t misunderstand: not just any explanation of a known effect passes severely, but one mistake – spurious effect – is already taken care of. Once the deflection effect was found to be genuine, it had to be a constraint on theorizing about its cause. In other cases, the trick is to hunt for a way to make the effect manifest in an experiment. When teratogenicity was found in babies whose mothers had been given thalidomide, it took them quite some time to find an animal in which the effect showed up: finally it was replicated in New Zealand rabbits! It is one thing if you are really going where the data take you , as opposed to subliminally taking the data where you want them to go. Severity makes the needed distinctions.
Renouncing Error Probabilities Leaves Trenchant Criticisms on the Table
Some of the harshest criticisms of frequentist error-statistical methods these days rest on principles that the critics themselves purport to reject. An example is for a Bayesian to criticize a reported P -value on the grounds that it failed to adjust for searching, while denying searching matters to evidence. If it is what I call a “ for thee and not for me” argument, the critic is not being inconsistent. She accepts the “ I don’ t care, but you do” horn. When a Bayesian, like I. J. Good, says a Fisherian but not a Bayesian can cheat with optional stopping, he means that error probabilities aren’ t the Bayesian’ s concern (Section 1.5 ). (Error probabilities without a subscript always refer to error probability1 (Section 3.6) from frequentist methods.) It’ s perfectly fair to take the “ we don’ t care” horn of my dilemma, and that’ s a great help in getting beyond the statistics wars. What’ s unfair is dismissing those who care as fetishizing imaginary data. Critics of error statistics should admit to consequences sufficiently concerning to be ensconced in statutes of best practices. At least if the statistics wars are to become less shrill and more honest.
What should we say about a distinct standpoint you will come across? First a critic berates a researcher for reporting an unadjusted P -value despite hunting and multiple testing. Next, because the critic’ s own methodology eschews the error statistical rationale on which those concerns rest, she is forced to switch to different grounds for complaining – generally by reporting disbelief in the effect that was hunted. Recall Royall blocking “ the deck is made up of aces of diamonds,” despite its being most likely (given the one ace of diamonds drawn), by switching to the belief category (Section 1.4 ). That might work in such trivial cases. In others, it weakens the intended criticism to the point of having it obliterated by those who deserve to be charged with severe testing crimes.
Exhibit (x): Relinquishing Their Strongest Criticism: Bem.
There was an ESP study that got attention a few years back (Bem 2011 ). Anyone choosing to take up an effect that has been demonstrated only with questionable research practices or has been falsified must show they have avoided the well-known tricks. But Bem openly admits he went on a fishing expedition to find results that appear to show an impressive non-chance effect, which he credits to ESP (subjects did better than chance at predicting which erotic picture they’ ll be shown in the future). The great irony is that Wagenmakers et al. (2011 ), keen as they are to show “ Psychologists Must Change the Way They Analyze Their Data” and trade significance tests for Bayes factors, relinquish their strongest grounds for criticism. While they mention Bem’ s P -hacking (fishing for a type of picture subjects get right most often), this isn’ t their basis for discrediting Bem’ s results. After all, Wagenmakers looks askance at adjusting for selection effects:
P values can only be computed once the sampling plan is fully known and specified in advance. In scientific practice, few people are keenly aware of their intentions, particularly with respect to what to do when the data turn out not to be significant after the first inspection. Still fewer people would adjust their p values on the basis of their intended sampling plan.
(Wagenmakers 2007 , p. 784)
Rather than insist they ought to adjust, Wagenmakers dismisses a concern with “ hypothetical actions for imaginary data” (ibid.). To criticize Bem, Wagenmakers et al. (2011) resort to a default Bayesian prior that makes the null hypothesis comparatively more probable than a chosen alternative (along the lines of Excursion 4, Tour II ). Not only does this forfeit their strongest criticism, they give Bem et al. (2011 ) a cudgel to thwack back at them:
Whenever the null hypothesis is sharply defined but the prior distribution on the alternative hypothesis is diffused over a wide range of values, as it is in … Wagenmakers et al. (2011 ), it boosts the probability that any observed data will be higher under the null hypothesis than under the alternative. This is known as the Lindley-Jeffreys paradox: A frequentist analysis that yields stron
g evidence in support of the experimental hypothesis can be contradicted by a misguided Bayesian analysis that concludes that the same data are more likely under the null.
(p. 717)
Instead of getting flogged, Bem is positioned to point to the flexibility of getting a Bayes factor in favor of the null hypothesis. Rather than showing psychologists should switch, the exchange is a strong argument for why they should stick to error statistical requirements.
P -Values Can’ t Be Trusted Except When Used to Argue That P -values Can’ t Be Trusted!
There is more than a whiff of inconsistency in proclaiming P -values cannot be trusted while in the same breath extolling the uses of statistical significance tests and P -values in mounting criticisms of significance tests and P-values. Isn’ t a critic who denies the entire error statistical methodology, significance test, N-P tests, and confidence intervals, also required to forfeit the results those methods give when they just happen to criticize a given of the tests? How much more so when those criticisms are the basis for charging someone with fraud. Yet that is not what we see.
Uri Simonsohn became a renowned fraud-buster by inventing statistical tests to rigorously make out his suspicions of the work of social psychologists Dirk Smeesters, Lawrence Sanna, and others – “ based on statistics alone”– as one of his titles reads. He shows the researcher couldn’ t have gotten so little variability, or the results are too good to be true – along with a fastidious analysis of numerous papers to rule out, statistically, any benign explanations (Simonsohn 2013 ). Statistician Richard Gill, often asked for advice on such cases, notes: “ The methodology here is not new. It goes back to Fisher (founder of modern statistics) in the 30’ s… The tests of goodness of fit were, again and again, too good” (2014 ). I expected that tribes who deny the evidential weight of significance tests would come to the defense of the accused, but (to my knowledge) none did.
Note, too, that the argument of the fraud-busters underscores the severity rationale for the case at hand. Critics called in to adjudicate high-profile cases of suspected fraud are not merely trying to ensure they will rarely erroneously pinpoint frauds in the long run. They are making proclamations on the specific case at hand – and in some cases, a person’ s job depends on it. They will use a cluster of examples to mount a strong argument from coincidence that the data in front of us could not have occurred without finagling. Other tools are used to survey a group of significance tests in a whole field, or by a given researcher. For instance, statistical properties of P -values are employed to ascertain if too many P -values at a given level are attained. These are called P -curves (Simonsohn et al. 2014 ). Such fraud detection machines at most give an indication about a field or group of studies. Of course, once known, they might themselves be gamed. But it’ s an intriguing new research field; and it is an interesting fact that when scientists need to warrant serious accusations of bad statistics, if not fraud, they turn to the error statistical reasoning and to statistical tests. If you got rid of them, they’ d only have to be reinvented by those who insist on holding others accountable for their statistical inferences.
Exhibit (xi): How Data-dependent Selections Invalidate Error Probability Guarantees.
It can be shown that a statistical method directly protects against data-dependent selections by demonstrating how they can cause a breakdown in methods. Philosopher of science Ronald Giere considers Neyman– Pearson interval estimation for a Binomial proportion. If assumptions are met, the sample mean will differ by 2 standard deviations from the true value of θ less than 5 percent of the time, approximately. Giere shows how to make the probability of successful estimates not 0.95 but 0! “ This will be sufficient to prove the point [the inadmissibility of this method] because Neyman’ s theory asserts that the average ratio of success is independent of the constitutions of the populations examined” (Giere 1969 , p. 375). Take a population of A ’ s and to each set of n members assign a shared property. The full population has U members where U > 2n . Then arbitrarily assign this same property to U /2 − n additional members.
Given a sufficient store of logically independent properties, this can be done for all possible combinations of n A ’ s. The result is a population so constructed that while every possible n -membered sample contains at least one apparent regularity [a property shared by all members of the sample] every independent property has an actual ratio of exactly one-half in the total population.
(ibid., p. 376) 8
The bottom line is, showing how you can distort error probabilities through the efforts of finagling shows the value of these methods. It’ s hard to see how accounts that claim error probabilities are irrelevant can supply such direct protection, although they may indirectly block the same fallacies. This remains to be shown.
Souvenir S: Preregistration and Error Probabilities
“ One of the best-publicized approaches to boosting reproducibility is preregistration … to prevent cherry picking statistically significant results” (Baker 2016 , p. 454). It shouldn’ t be described as too onerous to carry out. Selection effects alter the outcomes in the sample space, showing up in altered error probabilities. If the sample space (and so error probabilities) is deemed irrelevant post-data, the direct rationale for preregistration goes missing. Worse, in the interest of promoting a methodology that downplays error probabilities, researchers who most deserve lambasting are thrown a handy line of defense. Granted it is often presupposed that error probabilities are relevant only for long-run performance goals. I’ ve been disabusing you of that notion. Perhaps some of the “ never error probabilities” tribe will shift their stance now: ‘ But Mayo, using error probabilities for severity, differs from the official line, which is all about performance.’ One didn’ t feel too guilty denying a concern with error probabilities before. If viewing statistical inference as severe tests yields such a concession, I will consider my project a success. Actually, my immediate goal is less ambitious: to show that looking through the severity tunnel lets you unearth the crux of major statistical battles. In the meantime, no fair critic of error statistics should proclaim error control is all about hidden intentions that a researcher can’ t be held responsible for. They should be.
4.7 Randomization
The purpose of randomisation … is to guarantee the validity of the test of significance, this test being based on an estimate of error made possible by replication.
(Fisher [1935b ]1951, p. 26)
The problem of analysing the idea of randomization is more acute, and at present more baffling, for subjectivists than for objectivists, more baffling because an ideal subjectivist would not need randomization at all. He would simply choose the specific layout that promised to tell him the most.
(Savage 1962 , p. 34)
Randomization is a puzzle for Bayesians. The intuitive need for randomization is clear, but there is a standard result that Bayesians need not randomize.
(Berry and Kadane 1997 , p. 813)
Many Bayesians (though there are some very prominent exceptions) regard it as irrelevant and most frequentists (again there are some exceptions) consider it important.
(Senn 2007 , p. 34)
There’ s a nagging voice rarely heard from in today’ s statistical debates: if an account has no niche for error statistical reasoning, what happens to design principles whose primary purpose is to afford it? Randomization clearly exploits counterfactual considerations of outcomes that could have occurred, so dear to the hearts of error statisticians.
Some of the greatest contributions of statistics to science involve adding additional randomness and leveraging that randomness. Examples are randomized experiments, permutation tests, cross-validation and data-splitting. These are unabashedly frequentist ideas and, while one can strain to fit them into a Bayesian framework, they don’ t really have a place in Bayesian inference.
(Wasserman 2008 , p. 465)
One answer is to recognize that, apart from underwriting significance tests and t
he estimation of the standard error, randomization also has a role in preventing types of biased selections, especially where the context requires convincing others. Although these and other justifications are interesting and important, their defenders tend to regard them as subsidiary and largely optional uses for randomization.
Randomization country is huge; one scarcely does it justice in a single afternoon’ s tour. Moreover, a deeper look would require I call in a more expert field guide. A glimpse will shed light on core differences that interest us. Let’ s focus on the random allocation of a treatment or intervention, in particular in comparative treatment-control studies.
The problem with attributing Mary’ s lack of dementia (by age 80) to her having been taking HRT since menopause is that we don’ t know what her condition would have been like if she had not been so treated. Moreover, she’ s just one case and we’ re interested in treatment effects that are statistical. A factor is sometimes said to statistically contribute to some response where the response on average in the experimental population of treateds would be higher (or lower) than it would have been had they not been treated – in effect comparing two counterfactual populations. Randomized control experiments let us peer into these counterfactual populations by finding out about the difference between the average response in the treated group μ T and the average response among a control group μ C . With randomized control trials (RCTs), there is a deliberate introduction of a probabilistic assignment of a treatment of interest, using known chance mechanisms, such as a random number generator. Letting Δ = μ T − μ C , one may consider what Cox calls a strong (“ no effect” ) null: that the average response is no different or no greater among the treated than among the control group H 0 : Δ = 0 (vs. H 1 : Δ ≠ 0) , or a one-sided null: H 0 : Δ ≤ 0 vs. H 1 : Δ >