Statistical Inference as Severe Testing

Page 16

by Deborah G Mayo

H: “ men’ s implicit self-esteem is lower when a partner succeeds than when a partner fails.” (ibid.)

To take the weakest construal, H is the statistical alternative to a “ no effect” null H 0 . The “ treatment” is to think and write about a time their partner succeeded at something or failed at something. The effect will be a measure of “ self-esteem,” obtained either explicitly, by asking: “ How do you feel about yourself?” or implicitly, based on psychological tests of positive word associations (with “ me” versus “ other” ). Subjects are randomly assigned to five “ treatments” : think, write about a time your partner (i) succeeded, (ii) failed, (iii) succeeded when you failed, (iv) failed when you succeeded, or (v) a typical day (control) (ibid., p. 695). Here are a few of the several statistical null hypotheses (as no significant results are found among women, these allude to males thinking about female partners):

(a) The average implicit self-esteem is no different when subjects think about their partner succeeding (or failing) as opposed to an ordinary day.

(b) The average implicit self-esteem is no different when subjects think about their partner succeeding while the subject fails (“ she does better than me” ).

(c) The average implicit self-esteem is no different when subjects think about their partner succeeding as opposed to failing (“ she succeeds at something” ).

(d) The average explicit self-esteem is no different under any of the five conditions.

These statistical null hypotheses are claims about the distributions from which participants are sampled, limited to populations of experimental subjects – generally students who receive course credit. They merely assert the treated/non-treateds can be seen to come from the same populations as regards the average effect in question.

None of these nulls are able to be statistically rejected except (c)! Each negative result is anomalous for H . Should they take the research hypothesis as disconfirmed? Or as casting doubt on their test? Or should they focus on the null hypotheses that were rejected, in particular null (c). They opt for the third, viewing their results as “ demonstrating that men who thought about their romantic partner’ s success had lower implicit self-esteem than men who thought about their romantic partner’ s failure (ibid., p. 698). This is a highly careful wording. It refers only to a statistical effect, restricted to the experimental subjects. That’ s why I write it as H . Of course they really want to infer a causal claim – the self-esteem of males studied is negatively influenced, on average, by female partner success of some sort H *. More than that, they’ d like the results to be evidence that H * holds in the population of men in general, and speaks to the higher level theory ℋ .

On the face of it, it’ s a jumble. We do not know if these negative results reflect negatively on a research causal hypothesis – even limited to the experimental population – or whether the implicit self-esteem measure is actually picking up on something else, or whether the artificial writing assignment is insufficiently relevant to the phenomenon of interest. The auxiliaries linking the statistical and the substantive, the audit of the P -values and the statistical assumptions – all are potential sources of blame as we cast about solving the Duhemian challenge. Things aren’ t clear enough to say researchers should have regarded their research hypothesis as disconfirmed much less falsified. This is the nub of the problem.

What Might a Severe Tester Say?

I’ ll let her speak:

It appears from failing to reject (a) that our “ treatment” has no bearing on the phenomenon of interest. It was somewhat of a stretch to suppose that thinking about her “ success” (examples given are dancing, cooking, solving an algebra problem) could really be anything like the day Ann got a raise while Mark got fired. Take null hypothesis (b). It was expected that “ she beat me in X” would have a greater negative impact on self-esteem than merely, “ she succeeded at X.” Remember these are completely different groups of men, thinking about whatever it is they chose to. That the macho man’ s partner bowled well one day should have been less deflating than her getting a superior score. We confess that the non-significant difference in (b) casts a shadow on whether the intended phenomenon is being picked up at all. We could have interpreted it as supporting our research hypothesis. We could view it as lending “ some support to the idea that men interpret ‘ my partner is successful’ as ‘ my partner is more successful than me’” (ibid., p. 698). We could have reasoned, the two conditions show no difference because any success of hers is always construed by macho man as “ she showed me up.” This skirts too close to viewing the data through the theory, to a self-sealing fallacy . Our results lead us to question that this study is latching onto the phenomenon of interest. In fact, insofar as the general phenomenon ℋ (males taking umbrage at a partner’ s superior performance) is plausible, it would imply no effect would be found in this artificial experiment. Thus spake the severe tester.

I want to be clear that I’ m not criticizing the authors for not proceeding with the severe tester’ s critique; it would doubtless be considered outlandish and probably would not be accepted for publication. I deliberately looked at one of the better inquiries that also had a plausible research hypothesis. View this as a futuristic self-critical researcher.

While we’ re at it, are these implicit self-esteem tests off the table? Why? The authors admit that explicit self-esteem was unaffected (in men and women). Surely if explicit self-esteem had shown a significant difference, they would have reported it as support for their research hypothesis. Many psychology measurements not only lack a firm, independent check on their validity; if they disagree with more direct measurements, it is easily explained away or even taken as a point in their favor. Why do no differences show up on explicit measures of self-esteem? Available reasons: Men do not want to admit their self-esteem goes down when their partner succeeds, or they might be unaware of it. Maybe so, but this assumes what hasn’ t been vouchsafed. Why not revisit the subjects at a later date to compare their scores on implicit self- esteem? If we find no difference from their score under the experimental manipulation, we’ d have some grounds to deny it was validly measuring the effect of the treatment.

Here’ s an incentive: They’ re finding that the replication revolution has not made top psychology journals more inclined to publish non-replications – even of effects they have published. The journals want new, sexy effects. Here’ s sexy: stringently test (and perhaps falsify) some of the seminal measurements or types of inquiry used in psychology. In many cases we may be able to falsify given studies. If that’ s not exciting enough, imagine showing some of the areas now studied admit of no robust, generalizable effects. You might say it would be ruinous to set out to question basic methodology. Huge literatures on the “ well established” Macbeth effect, and many others besides, might come in for question. I said it would be revolutionary for psychology. Psychometricians are quite sophisticated, but their work appears separate from replication research. Who would want to undermine their own field? Already we hear of psychology’ s new “ spirit of self-flaggelation” (Dominus 2017). It might be an apt job for philosophers of science, with suitable expertise, especially now that these studies are being borrowed in philosophy. 9

A hypothesis to be considered must always be: the results point to the inability of the study to severely probe the phenomenon of interest. The goal would be to build up a body of knowledge on closing existing loopholes when conducting a type of inquiry. How do you give evidence of “ sincerely trying (to find flaws)?” Show that you would find the studies poorly run, if the flaws were present. When authors point to other studies as offering replication, they should anticipate potential criticisms rather than showing “ once again I can interpret my data through my favored theory.” The scientific status of an inquiry is questionable if it cannot or will not distinguish the correctness of inferences from problems stemming from a poorly run study. What must be subjected to grave risk are assumptions that the experiment
was well run. This should apply as well to replication projects, now under way. If the producer of the report is not sufficiently self-skeptical, then we the users must be.

Live Exhibit (vii): Macho Men.

Entertainment on this excursion is mostly home grown. A reenactment of this experiment will do. Perhaps hand questionnaires to some of the males after they lose to their partners in shuffle board or ping pong. But be sure to include the most interesting information unreported in the study on self-esteem and partner success. Possibly it was never even looked at: What did the subjects write about? What kind of question would Mr. “ My-self-esteem-goes-down-when-she-succeeds” elect to think and write about? Consider some questions that would force you to reinterpret even the statistically significant results.

Exhibit (viii): The Multiverse.

Gelman and Loken (2014 ) call attention to the fact that even without explicitly cherry picking, there is often enough leeway in the “ forking paths” between data and inference so that a researcher may be led to a desired inference. Growing out of this recognition is the idea of presenting the results of applying the same statistical procedure, but with different choices along the way (Steegen et al. 2016 ). They call it a multiverse analysis . One lists the different choices thought to be plausible at each stage of data processing. The multiverse displays “… which constellation of choices corresponds to which statistical result” (p. 707).

They consider an example from 2012 purporting to show that single women prefer Obama to Romney when they are highly fertile; the reverse when they’ re at low fertility.

In two studies with relatively large and diverse samples of women, we found that ovulation had different effects on women’ s religious and political orientation depending on whether women were single or in committed relationships. Ovulation led single women to become more socially liberal, less religious, and more likely to vote for Barack Obama.

(Durante et al. 2013 , p. 1013)

Unlike the Macho Men study, this one’ s not intuitively plausible. In fact, it was pummeled so vehemently by the public that it had to be pulled off CNN. 10 Should elaborate statistical criticism be applied to such studies? I had considered them only human interest stories. But Gelman rightly finds in them some general lessons.

One of the choice points in the ovulation study would be where to draw the line at “ highly fertile” based on days in a woman’ s cycle. It wasn’ t based on any hormone check but on an online questionnaire asking when they’ d had their last period. There’ s latitude in using such information to decide whether to place someone in a low or high fertility group (Steegen et al. 2016 , p. 705, find five sets of data that could have been used). It turned out that under the other five choice points, many of the results were not statistically significant. Each of the different consistent combinations of choice points could count as a distinct hypothesis, and you can then consider how many of them are statistically insignificant.

If no strong arguments can be made for certain choices, we are left with many branches of the multiverse that have large P values. In these cases, the only reasonable conclusion on the effect of fertility is that there is considerable scientific uncertainty. One should reserve judgment …

(ibid., p. 708)

Reserve judgment? If we’ re to apply our severe testing norms on such examples, and not dismiss them as entertainment only, then we’ d go further. Here’ s another reasonable conclusion: The core presumptions are falsified (or would be with little effort). Say each person with high fertility in the first study is tested for candidate preference next month when they are in the low fertility stage. If they have the same voting preferences, the test is falsified. The spirit of their multiverse analysis is a quintessentially error statistical gambit. Anything that increases the flabbiness in uncovering flaws lowers the severity of the test that has passed (we’ ll visit P -value adjustments later on). But the onus isn’ t on us to give them a pass. As we turn to impressive statistical meta-critiques, what can be overlooked is whether the entire inquiry makes any sense. Readers will have many other tomatoes to toss at the ovulation research. Unless the overall program is falsified, the literature will only grow. We don’ t have to destroy statistical significance tests when what we really want is to show that a lot of studies constitute pseudoscience.

Souvenir G: The Current State of Play in Psychology

Failed replications, we hear, are creating a “ cold war between those who built up modern psychology and those” tearing it down with failed replications (Letzter 2016 ). The severe tester is free to throw some fuel on both fires.

The widespread growth of preregistered studies is all to the good; it’ s too early to see if better science will result. Still, credit is due to those sticking their necks out to upend the status quo. I say it makes no sense to favor preregistration and also deny the relevance to evidence of optional stopping and outcomes other than the one observed. That your appraisal of the evidence is altered when you actually see the history supplied by the registered report is equivalent to worrying about biasing selection effects when they’ re not written down; your statistical method should pick up on them.

By reviewing the hypotheses and analysis plans in advance, RRs (registered reports) should also help neutralize P-hacking and HARKing (hypothesizing after the results are known) by authors, and CARKing (critiquing after the results are known) by reviewers with their own investments in the research outcomes, although empirical evidence will be required to confirm that this is the case.

(Munafò et al. 2017 , p. 5)

The papers are provisionally accepted before the results are in. To the severe tester, that requires the author to explain how she will pinpoint blame for negative results. I see nothing in preregistration, in and of itself, to require that. It would be wrong-headed to condemn CARKing: post-data criticism of assumptions and inquiries into hidden biases might be altogether warranted. For instance, one might ask about the attitude toward the finding conveyed by the professor: what did the students know and when did they know it? Of course, they must not be ad hoc saves of the finding.

The field of meta-research is bursting at the seams: distinct research into changing incentives is underway. The severe tester may be jaundiced to raise qualms, but she doesn’ t automatically assume that research into incentivizing researchers to behave in a fashion correlated with good science – data sharing, preregistration – is itself likely to improve the original field. Not without thinking through what would be needed to link statistics up with the substantive research problem. In some fields, one wonders if they would be better off ignoring statistical experiments and writing about plausible conjectures about human motivations, prejudices, or attitudes, perhaps backed by interesting field studies. It’ s when researchers try to test them using sciency methods that the project becomes pseudosciency.

2.7 How to Solve the Problem of Induction Now

Viewing inductive inference as severe testing, the problem of induction is transformed into the problem of showing the existence of severe tests and methods for identifying insevere ones. The trick isn’ t to have a formal, context-free method that you can show is reliable – as with the traditional problem of induction; the trick is to have methods that alert us when an application is shaky. As a relaxing end to a long tour, our evening speaker on ship, a severe tester, will hold forth on statistics and induction.

Guest Speaker: A Severe Tester on Solving Induction Now

Here’ s his talk:

For a severe tester like me, the current and future problem of induction is to identify fields and inquiries where inference problems are solved efficiently, and ascertain how obstacles are overcome – or not. You’ ve already assembled the ingredients for this final leg of Tour II, including: lift-off, convergent arguments (from coincidence), pinpointing blame (Duhem’ s problem), and falsification. Essentially, the updated problem is to show that there exist methods for controlling and assessing error probabilities. Does that seem too easy?
The problem has always been rather minimalist: to show at least some reliable methods exist; the idea being that they could then be built upon. Just find me one. They thought enumerative induction was the one, but it’ s not. I will examine four questions: 1. What warrants inferring a hypothesis that stands up to severe tests? 2. What enables induction (as severe testing) to work? 3. What is Neyman’ s quarrel with Carnap? and 4. What is Neyman’ s empirical justification for using statistical models?

1. What Warrants Inferring a Hypothesis that Passes Severe Tests?

Suppose it is agreed that induction is severe testing. What warrants moving from H passing a severe test to warranting H ? Even with a strong argument from coincidence akin to my weight gain showing up on myriad calibrated scales, there is no logical inconsistency with invoking a hypothesis from conspiracy : all these instruments conspire to produce results as if H were true but in fact H is false. The ultra-skeptic may invent a rigged hypothesis R :

R : Something else other than H actually explains the data

without actually saying what this something else is. That is, we’ re imagining the extreme position of someone who simply asserts, H is actually false, although everything is as if it’ s true. Weak severity alone can block inferring a generic rigged hypothesis R as a way to discount a severely tested H . It can’ t prevent you from stopping there and never allowing a hypothesis is warranted. (Weak severity merely blocks inferring claims when little if anything has been done to probe them.) Nevertheless, if someone is bound to discount a strong argument for H by rigging, then she will be adopting a highly unreliable method. Why? Because a conspiracy hypothesis can always be adduced! Even with claims that are true, or where problems are solved correctly, you would have no chance of finding this out. I began with the stipulation that we wish to learn. Inquiry that blocks learning is pathological. Thus, because I am a severe tester, I hold both strong and weak severity:

‹ Prev Next ›