Statistical Inference as Severe Testing

Page 15

by Deborah G Mayo

1. The reported (nominal) statistical significance result is spurious (it’ s not even an actual P -value). This can happen in two ways: biasing selection effects, or violated assumptions of the model.

2. The reported statistically significant result is genuine, but it’ s an isolated effect not yet indicative of a genuine experimental phenomenon. (Isolated low P -value ≠ > H : statistical effect.)

3. There’ s evidence of a genuine statistical phenomenon but either (i) the magnitude of the effect is less than purported, call this a magnitude error , 4 or (ii) the substantive interpretation is unwarranted (H ≠ > H* ).

I will call an audit of a P -value, a check of any of these concerns, generally in order, depending on the inference. That’ s why I place the background information for auditing throughout our “ series of models” representation (Figure 2.1 ). Until audits are passed, the relevant statistical inference is to be reported as “ unaudited.” Until 2 is ruled out, it’ s a mere “ indication,” perhaps, in some settings, grounds to get more data.

Meehl’ s criticism is to a violation described in 3(ii). Like many criticisms of significance tests these days, it’ s based on an animal that goes by the acronym NHST (null hypothesis significance testing). What’ s wrong with NHST in relation to Fisherian significance tests? The museum label says it for me:

If NHST permits going from a single small P -value to a genuine effect, it is illicit; and if it permits going directly to a substantive research claim it is doubly illicit!

We can add: if it permits biasing selection effects it’ s triply guilty. Too often NHST refers to a monster describing highly fallacious uses of Fisherian tests, introduced in certain social sciences. I now think it’ s best to drop the term NHST. Statistical tests will do, although our journey requires we employ the terms used in today’ s battles.

Shall we blame the wise and sagacious Meehl with selective reading of Fisher? I don’ t know. Meehl gave me the impression that he was irked that using significance tests seemed to place shallow areas of psychology on a firm falsification footing; whereas, more interesting, deep psychoanalytic theories were stuck in pseudoscientific limbo. He and Niels Waller give me the honor of being referred to in the same breath as Popper and Salmon:

For the corroboration to be strong, we have to have ‘ Popperian risk’ , … ‘ severe test’ [as in Mayo], or what philosopher Wesley Salmon called a highly improbable coincidence [“ damn strange coincidence” ].

(Meehl and Waller 2002 , p. 284)

Yet we mustn’ t blur an argument from coincidence merely to a real effect, and one that underwrites arguing from coincidence to research hypothesis H* . Meehl worried that, by increasing the sample size, trivial discrepancies can lead to a low P -value, and using NHST, evidence for H * too readily attained. Yes, if you plan to perform an illicit inference, then whatever makes the inference easier (increasing sample size) is even more illicit. Since proper statistical tests block such interpretations, there’ s nothing anti-Popperian about them.

The fact that selective reporting leads to unreplicable results is an asset of significance tests: If you obtained your apparently impressive result by violating Fisherian strictures, preregistered tests will give you a much deserved hard time when it comes to replication. On the other hand, evidence of a statistical effect H does give a B-boost to H *, since if H * is true, a statistical effect follows (statistical affirming the consequent).

Meehl’ s critiques rarely mention the methodological falsificationism of Neyman and Pearson. Why is the field that cares about power – which is defined in terms of N-P tests – so hung up on simple significance tests? We’ ll disinter the answer later on. With N-P tests, the statistical alternative to the null hypothesis is made explicit: the null and alternative exhaust the possibilities. There can be no illicit jumping of levels from statistical to causal (from H to H *). Fisher didn’ t allow jumping either, but he was less explicit. Statistically significant increased yields in Fisher’ s controlled trials on fertilizers, as Gigerenzer notes, are intimately linked to a causal alternative. If the fertilizer does not increase yield (H * is false, so ~H * is true), then no statistical increase is expected, if the test is run well. 5 Thus, finding statistical increases (rejecting H 0 ) is grounds to falsify ~H * and find evidence of H *. Unlike the typical psychology experiment, here rejecting a statistical null very nearly warrants a statistical causal claim. If you want a statistically significant effect to (statistically) warrant H * show:

If ~H * is true (research claim H * is false), then H 0 won’ t be rejected as inconsistent with data, at least not regularly.

Psychology should move to an enlightened reformulation of N-P and Fisher (see Section 3.3 ). To emphasize the Fisherian (null hypothesis only) variety, we follow the literature in calling them “ simple” significance tests. They are extremely important in their own right: They are the basis for testing assumptions without which statistical methods fail scientific requirements. View them as just one member of a panoply of error statistical methods.

Statistics Can’ t Fix Intrinsic Latitude.

The problem Popper found with Freudian and Adlerian psychology is that any observed behavior could be readily interpreted through the tunnel of either theory. Whether a man jumped in the water to save a child, or if he failed to save her, you can invoke Adlerian inferiority complexes, or Freudian theories of sublimation or Oedipal complexes (Popper 1962 , p. 35). Both Freudian and Adlerian theories can explain whatever happens. This latitude has nothing to do with statistics. As we learned from Exhibit (vi) , Section 2.3, we should really speak of the latitude offered by the overall inquiry: research question, auxiliaries, and interpretive rules. If it has self-sealing facets to account for any data, then it fails to probe with severity. Statistical methods cannot fix this. Applying statistical methods is just window dressing. Notice that Freud/Adler, as Popper describes them, are amongst the few cases where the latitude really is part of the theory or terminology. It’ s not obvious that Popper’ s theoretical novelty bars this, unless one of Freud/Adler is deemed first. We’ ve arrived at the special topical installation on:

2.6 The Reproducibility Revolution (Crisis) in Psychology

I was alone in my tastefully furnished office at the University of Groningen. … I opened the file with the data that I had entered and changed an unexpected 2 into a 4; then, a little further along, I changed a 3 into a 5. … When the results are just not quite what you’ d so badly hoped for; when you know that that hope is based on a thorough analysis of the literature; … then, surely, you’ re entitled to adjust the results just a little? … I looked at the array of data and made a few mouse clicks to tell the computer to run the statistical analyses. When I saw the results, the world had become logical again.

(Stapel 2014 , p. 103)

This is Diederik Stapel describing his “ first time” – when he was still collecting data and not inventing them. After the Stapel affair (2011), psychologist Daniel Kahneman warned that he “ saw a train wreck looming” for social psychology and called for a “ daisy chain” of replication to restore credibility to some of the hardest hit areas such as priming studies (Kahneman 2012 ). Priming theory holds that exposure to an experience can unconsciously affect subsequent behavior. Kahneman (2012 ) wrote: “ right or wrong, your field is now the poster child for doubts about the integrity of psychological research.” One of the outgrowths of this call was the 2011– 2015 Reproducibility Project, part of the Center for Open Science Initiative at the University of Virginia. In a nutshell: This is a crowd-sourced effort to systematically subject published statistically significant findings to checks of reproducibility. In 2011, 100 articles from leading psychology journals from 2008 were chosen; in August of 2015, it was announced only around 33% could be replicated (depending on how that was defined). Whatever you think of the results, it’ s hard not to be impressed that a field could organize such a self-critical project, obtain the resources, and galvanize serious-minded professionals to carry it
out.

First, on the terminology: The American Statistical Association (2017 , p. 1) calls a study “ reproducible if you can take the original data and the computer code used … and reproduce all of the numerical findings …” In the case of Anil Potti, they couldn’ t reproduce his numbers. By contrast, replicability refers to “ the act of repeating an entire study, independently of the original investigator without the use of the original data (but generally using the same methods)” (ibid.). 6 The Reproducibility Project, however, is really a replication project (as the ASA defines it). These points of terminology shouldn’ t affect our discussion. The Reproducibility Project is appealing to what most people have in mind in saying a key feature of science is reproducibility, namely replicability.

So how does the Reproducibility Project proceed? A team of (self-selected) knowledgeable replicators, using a protocol that is ideally to be approved by the initial researchers, reruns the study on new subjects. A failed replication occurs when there’ s a non-statistically significant or negative result, that is, a P -value that is not small (say >0.05). Does a negative result mean the original result was a false positive? Or that the attempted replication was a false negative? The interpretation of negative statistical results is itself controversial, particularly as they tend to keep to Fisherian tests, and effect sizes are often fuzzy. When RCTs fail to replicate observational studies, the presumption is that, were the effect genuine, the RCTs would have found it. That is why they are taken as an indictment of the observational study. But here, you could argue, the replication of the earlier research is not obviously a study that checks its correctness. Yet that would be to overlook the strengthened features of the replications in the 2011 project: they are preregistered, and are designed to have high power (against observed effect sizes). What is more, they are free of some of the “ perverse incentives” of usual research. In particular, the failed replications are guaranteed to be published in a collective report. They will not be thrown in file drawers, even if negative results ensue.

Some ironic consequences immediately enter in thinking about the project. The replication researchers in psychology are the same people who hypothesize that a large part of the blame for lack of replication may be traced to the reward structure: to incentives to publish surprising and sexy studies, coupled with an overly flexible methodology opening the door to promiscuous QRPs. Call this the flexibility, rewards, and bias hypothesis. Supposing this hypothesis is correct, as is quite plausible, what happens when non-replication becomes sexy and publishable? Might non-significance become the new significance? Science likely wouldn’ t have published individual failures to replicate, but they welcomed the splashy OSC report of the poor rate of replication they uncovered, as well as back-and-forth updates by critics. Brand new fields of meta-research open up for replication specialists, all ostensibly under the appealing banner of improving psychology. Some ask: should authors be prey to results conducted by a self-selected group – results that could obviously impinge rather unfavorably on them? Many say no and even liken the enterprise to a witch hunt. Kahneman (2014 ) called for “ a new etiquette” requiring original authors to be consulted on protocols:

… tension is inevitable when the replicator does not believe the original findings and intends to show that a reported effect does not exist. The relationship between replicator and author is then, at best, politely adversarial. The relationship is also radically asymmetric: the replicator is in the offense, the author plays defense. The threat is one-sided because of the strong presumption in scientific discourse that more recent news is more believable.

(p. 310)

It’ s not hard to find potential conflicts of interest and biases on both sides. There are the replicators’ attitudes – not only toward the claim under study, but toward the very methodology used to underwrite it – usually statistical significance tests. Every failed replication is seen (by some) as one more indictment of the method (never minding its use in showing irreplication). There’ s the replicator’ s freedom to stop collecting data once minimal power requirements are met, and the fact that subjects – often students, whose participation is required – are aware of the purpose of the study, revealed at the end. (They are supposed to keep it confidential over the life of the experiment, but is that plausible?) On the other hand, the door may be open too wide for the original author to blame any failed replication on lack of fidelity to nuances of the original study. Lost in the melee is the question of whether any constructive criticism is emerging.

Incidentally, here’ s a case where it might be argued that loss and cost functions are proper, since the outcome goes beyond statistical inference to reporting a failure to replicate Jane’ s study, perhaps overturning her life’ s research.

What Might a Real Replication Revolution in Psychology Require?

Even absent such concerns, the program seems to be missing the real issues that leap out at the average reader of the reports. The replication attempts in psychology stick to what might be called “ purely statistical” issues: can we get a low P -value or not? Even in the absence of statistical flaws, research conclusions may be disconnected from the data used in their testing, especially when experimentally manipulated variables serve as proxies for variables of theoretical interest. A serious (and ongoing) dispute arose when a researcher challenged the team who failed to replicate her hypothesis that subjects “ primed” with feelings of cleanliness, sometimes through unscrambling soap-related words, were less harsh in judging immoral such bizarre actions as whether it is acceptable to eat your dog after it has been run over. A focus on the P -value computation ignores the larger question of the methodological adequacy of the leap from the statistical to the substantive. Is unscrambling soap-related words an adequate proxy for the intended cleanliness variable? The less said about eating your run-over dog, the better. At this point Bayesians might argue, “ We know these theories are implausible, we avoid the inferences by invoking our disbeliefs.” That can work in some cases, except that the researchers find them plausible, and, more than that, can point to an entire literature on related studies, say, connecting physical and moral purity or impurity (part of “ embodied cognition” ) e.g., Schnall et al. 2008. The severe tester shifts the unbelievability assignment. What’ s unbelievable is supposing their experimental method provides evidence for purported effects! Some philosophers look to these experiments on cleanliness and morality, and many others, to appraise their philosophical theories “ experimentally.” 7 Whether or not this is an advance over philosophical argument, philosophers should be taking the lead in critically evaluating the methodology, in psychology and, now, in philosophy.

Our skepticism is not a denial that we may often use statistical tests to infer a phenomenon quite disparate from the experimental manipulations. Even an artificial lab setting can teach us about a substantive phenomenon “ in the wild” so long as there are testable implications for the statistically modeled experiment. The famous experiments by Harlow, showing that monkeys prefer a cuddly mom to a wire mesh mom that supplies food (Harlow 1958 ), are perfectly capable of letting us argue from coincidence to what matters to actual monkeys. Experiments in social psychology are rarely like that.

The “ replication revolution in psychology” won’ t be nearly revolutionary enough until they subject to testing the methods and measurements intended to link statistics with what they really want to know. If you are an ordinary skeptical reader, outside psychology, you’ re probably flummoxed that researchers blithely assume that role playing by students, unscrambling of words, and those long-standing 5, 7, or 10 point questionnaires are really measuring the intended psychological attributes. Perhaps it’ s taboo to express this. Imagine that Stapel had not simply fabricated his data, and he’ d found that students given a mug of M&M’ s emblazoned with the word “ capitalism” ate statistically significantly more candy than those with a scrambled word on their mug– as one of his make-believe studies proposed (Stapel 2014 , pp. 1
27– 8). Would you think you were seeing the effects of greed in action?

Psychometrician Joel Michell castigates psychology for having bought the operationalist Stevens’ (1946 , p. 667) “ famous definition of measurement as ‘ the assignment of numerals to objects or events according to rules’” , a gambit he considers a deliberate and “ pathological” ploy to deflect “ attention from the issue of whether psychological attributes are quantitative” to begin with (Michell 2008 , p. 9). It’ s easy enough to have a rule for assigning numbers on a Likert questionnaire, say on degrees of moral opprobrium (never OK, sometimes OK, don’ t know, always OK) if it’ s not required to have an independent source of its validity. (Are the distances between units really equal, as statistical analysis requires?) I prefer not to revisit studies against which it’ s easy to take pot shots. Here’ s a plausible phenomenon, confined, fortunately, to certain types of people.

Macho Men: Falsifying Inquiries

I have no doubts that certain types of men feel threatened by the success of their female partners, wives, or girlfriends – more so than the other way around. I’ ve even known a few. Some of my female students, over the years, confide that their boyfriends were angered when they got better grades than they did! I advise them to drop the bum immediately if not sooner. The phenomenon is backed up by field statistics (e.g., on divorce and salary differentials where a woman earns more than a male spouse, Thaler 2013 8 ). As we used H (the statistical hypothesis), and H * (a corresponding causal claim), let’ s write this more general phenomenon as ℋ . Can this be studied in the lab? Ratliff and Oishi (2013 ) “ examined the influence of a romantic partner’ s success or failure on one’ s own implicit and explicit self-esteem” (p. 688). Their statistical studies show that

‹ Prev Next ›