Psychology’s Crisis
Ellen Winner
Professor of psychology, Boston College
The field of psychology is experiencing a crisis. Our studies do not replicate. When Science recently published the results of attempts to replicate 100 studies, those results were not confidence-inspiring, to say the least.* The average-effect sizes declined substantially, and while 97 percent of the original papers reported significant p values, only 36 percent of the replications did.
The same difficulty in reproducing findings is found in other scientific fields. Psychology is not alone. We know why so many studies that don’t replicate were published in the first place: because of the intense pressure to publish in order to get tenure and grants and teach fewer courses—and because of journals’ preference for publishing counterintuitive findings over less surprising ones. But it is worth noting that one-shot priming studies are far more likely to be flukes than longitudinal descriptive studies (e.g., studies examining changes in language in the second year of life) and qualitative studies (e.g., studies in which people are asked to reflect on and explain their responses and those of others).
In reaction to these jarring findings, journals are now changing their policies. No longer will they accept single studies with small sample sizes and p values hovering just below .05. But this is only the first step. Because new policies will result in fewer publications per researcher, universities will have to change their hiring, tenure, and reward systems, and so will granting and award-giving agencies. We need to stop the lazy practice of counting publications and citations, and instead read critically for quality. That takes time.
Good will come of this. Psychology will report findings that are more likely to be true, less likely to lead to urban myths. This will enhance the field’s reputation and, more important, our understanding of human nature.
The Truthiness of Scientific Research
Judith Rich Harris
Independent investigator and theoretician; author, The Nurture Assumption
The topic itself is not new. For decades, there have been rumors about famous historical scientists like Newton, Kepler, and Mendel; the charge was that their research results were too good to be true. They must have faked the data, or at least prettied it up a bit. But Newton, Kepler, and Mendel nonetheless retained their seats in the Science Hall of Fame. The usual reaction of those who heard the rumors was a shrug. So what? They were right, weren’t they?
What’s new is that nowadays everyone seems to be doing it, and they’re not always right. In fact, according to John Ioannidis, they’re not even right most of the time. John Ioannidis is the author of a paper titled “Why Most Published Research Findings Are False,” which appeared in a medical journal in 2005.* Nowadays this paper is described as “seminal,” but at first it received little attention outside the field of medicine, and even medical researchers didn’t seem to be losing sleep over it.
Then people in my own field, psychology, began to voice similar doubts. In 2011, the journal Psychological Science published a paper titled “False-positive Psychology: Undisclosed Flexibility in Data Collection and Analysis Allows Presenting Anything as Significant.” In 2012, the same journal published a paper on “the prevalence of questionable research practices.”* In an anonymous survey of more than 2,000 psychologists, 53 percent admitted that they had failed to report all of a study’s dependent measures, 38 percent had decided to exclude data after calculating the effect it would have on the outcome, and 16 percent had stopped collecting data earlier than planned because they had gotten the results they were looking for.
The final punch landed in August 2015. The news was published first in the journal Science and quickly announced to the world by the New York Times, under a headline that was surely facetious: “Psychologists welcome analysis casting doubt on their work.” The article itself painted a more realistic picture. “The field of psychology sustained a damaging blow,” it began. “A new analysis found that only 36 percent of findings from almost 100 studies in the top three psychology journals held up when the original experiments were rigorously redone.” On average, effects found in the replications were only half the magnitude of those reported in the original publications.
Why have things gone so badly awry in psychological and medical research? And what can be done to put them right again?
I think there are two reasons for the decline of truth and the rise of truthiness in scientific research. First, research is no longer something people do for fun, because they’re curious. It has become something that people are required to do, if they want a career in the academic world. Whether they enjoy it or not, whether they are good at it or not, they’ve got to turn out papers every few months or their career is down the tubes. The rewards for publishing have become too great relative to the rewards for doing other things, such as teaching. People are doing research for the wrong reasons: not to satisfy their curiosity but to satisfy their ambitions. There are too many journals publishing too many papers. Most of what’s in them is useless, boring, or wrong. The solution is to stop rewarding people on the basis of how much they publish. Surely the tenure committees at great universities could come up with other criteria on which to base their decisions!
The second thing that has gone awry is the vetting of research papers. Most journals send out submitted manuscripts for review. The reviewers are unpaid experts in the same field, who are expected to read the manuscript carefully, make judgments about the importance of the results and the validity of the procedures, and put aside any thoughts of how the publication of this paper might affect their own prospects. It’s a hard job that has gotten harder over the years, as research has become more specialized and data analysis more complex. I propose that this job be performed by paid experts—accredited specialists in the analysis of research. Perhaps this could provide an alternative path into academia for people who don’t particularly enjoy the nitty-gritty of doing research but love ferreting out the flaws in the research of others.
In Woody Allen’s movie Sleeper, set 200 years in the future, a scientist explains that people used to think that wheat germ was healthy and that steak, cream pie, and hot fudge were unhealthy—“precisely the opposite of what we now know to be true.” It’s a joke that hits too close to home. Bad science gives science a bad name. Whether wheat germ is or isn’t good for people is a minor matter. But whether people believe in scientific research or scoff at it is of crucial importance to the future of our planet and its inhabitants.
Blinded by Data
Gary Klein
Psychologist; senior scientist, MacroCognition LLC; author, Seeing What Others Don’t
The 23 October 2015 issue of the journal Science reported a feel-good story about how some children in India had received cataract surgery and were able to see. On the surface, nothing in this incident should surprise us. Ready access to cataract surgery is something we take for granted. But the story is not that simple.
The children had been born with cataracts. They had never been able to see. By the time their condition was diagnosed—they came from impoverished and uneducated families in remote regions—the regional physicians had told the parents that it was too late for a cure because the children were past a critical period for gaining vision. Nevertheless, a team of eye specialists arranged for the cataract surgery to be performed even on teenagers. Now, hundreds of formerly blind children are able to see. After having the surgery four years earlier, one young man of twenty-two can ride a bicycle through a crowded market.
The concept of a critical period for developing vision was based on studies that David Hubel and Torsten Wiesel performed on cats and monkeys. The results showed that without visual signals during a critical period of development, vision is impaired for life. For humans, this critical window was thought to close tight by the time a child was eight years old. (For ethical reasons, no comparable studies were run on humans.) Hubel and Wiesel were awarded the Nobel Prize in 1981 for th
eir work. And physicians around the world stopped performing cataract surgery on children older than eight. The data were clear. But they were wrong. The results of the cataract surgeries on Indian teenagers disprove the critical-period data.
In this light, an apparent “feel-good” story becomes a “feel-bad” story about innumerable other children who were denied cataract surgery because they were too old. Consider all the children who endured a lifetime of blindness because of excessive faith in misleading data.
The theme of excessive faith in data was illustrated by another 2015 news item. Brian Nosek and a team of researchers set out to replicate 100 high-profile psychology experiments that had been performed in 2008. They reported their findings in the 28 August 2015 issue of Science. Only about a third of the original findings were replicated and even for these, the effect size was much smaller than the initial report.
Other fields have run into the same problem. A few years ago the journal Nature reported that most of the cancer studies selected for review could not be replicated. In October 2015, Nature devoted a special issue to exploring various ideas for reducing the number of non-reproducible findings. Many others have begun examining how to reduce the chances of unreliable data.
I think this is the wrong approach. It exemplifies the bedrock bias: a desire for a firm piece of evidence that can be used as a foundation for deriving inferences.
Scientists appreciate the tradeoff between Type I errors (detecting effects that aren’t actually present—false positives) and Type II errors (failing to detect an effect that is present—false negatives). When you put more energy into reducing Type I errors, you run the risk of increasing Type II errors, missing findings and discoveries. Thus we might change the required significance level from .05 to .01, or even .001, to reduce the chances of a false positive, but in so doing we would greatly increase the false negatives.
The bedrock bias encourages us to make extreme efforts to eliminate false positives, but that approach would slow progress. A better perspective is to give up the quest for certainty and accept the possibility that any datum may be wrong. After all, skepticism is a mainstay of the scientific enterprise.
I recall a conversation with a decision researcher who insisted that we cannot trust our intuitions; instead, we should trust the data. I agreed that we should never trust intuitions (we should listen to our intuitions but evaluate them), but I didn’t agree that we should trust the data. There are too many examples, as described above, where the data can blind us.
What we need is the ability to draw on relevant data without committing ourselves to their validity. We need to be able to derive inferences, make speculations, and form anticipations in the face of ambiguity and uncertainty. And to do that, we will need to overcome the bedrock bias—to free ourselves from the expectation that we can trust the data.
I’m not arguing that it’s OK to get the research wrong—witness the consequence of all the Indian children who suffered unnecessary blindness. My argument is that we shouldn’t ignore the possibility that the data might be wrong. The team of Indian eye specialists responded to anecdotes about cases of recovered vision and explored the possible benefits of cataract surgery past the critical period.
The heuristics-and-biases community has done an impressive job of sensitizing us to the limits of our heuristics and intuitions. Perhaps we need a parallel effort to sensitize us to the limits of the data—a research agenda demonstrating the kinds of traps we fall into when we trust the data too much. This agenda might examine the underlying causes of the bedrock bias, and possible methods for de-biasing ourselves. A few cognitive scientists have performed experiments on the difficulty of working with ambiguous data, but I think we need more: a larger, coordinated research program.
Such an enterprise would have implications beyond the scientific community. We live in an era of Big Data, an era in which quants are taking over Wall Street, an era of evidence-based strategies. In a world that is increasingly data-centered, there may be value in learning how to work with imperfect data.
The Epistemic Trainwreck of Soft-Side Psychology
Philip Tetlock
Psychologist and political scientist; Annenberg University Professor, University of Pennsylvania; co-author (with Dan Gardner), Superforecasting
Thirty-five years ago, I was an insecure assistant professor at the University of California at Berkeley, and a curmudgeonly senior colleague from the hard-science side of psychology took me aside to warn me that I was wasting whatever scientific talent I might have. My field, broad-brushed as the soft side of psychology, was well intentioned but premature. Soft-siders wanted to help people but they hadn’t a clue how to do it.
Now I get to play curmudgeon. The recent wave of disclosures about the non-replicability of many soft-side research phenomena suggests that my skeptical elder knew more than I then realized. The big soft-side scientific news is that a disconcertingly large fraction of it does not bear close scrutiny. The exact fraction is hard to gauge; my current guess is at least 25 percent and perhaps as high as 50. But historians of science will not have a hard time portraying this epistemic trainwreck as retrospectively inevitable. Social psychology and overlapping disciplines evolved into fields that incentivized scholars to get over the talismanic p < .05 significance line to support claims to original discoveries, and disincentivized the grunt work of assessing replicability and scoping out boundary conditions. And as Duarte et al. point out in Behavioral and Brain Sciences (“Political Diversity Will Improve Social Psychological Science”), the growing political homogeneity of the field selectively incentivized the production of counterintuitive findings that would jar the public into realizing how deeply unconsciously unfair the social order is. This has proved a dangerous combination.
In our rushed quest to establish our capacity for surprising smart outsiders while also helping those who had long gotten the short end of the status stick, soft-siders forgot the normative formula that Robert Merton offered in 1942 for successful social science: the CUDOS (Communalism, Universalism, Disinterestedness, Originality, and Skepticism) norms for protecting us from absurdities like Stalinist genetics and Aryan physics. The road to scientific hell is paved with political intentions, some well intentioned, some maniacally evil. If you value science as a purely epistemic game, the effects are equally corrosive. When you replace the pursuit of truth with the protection of dogma, you get politically/religiously tainted knowledge. Mertonian science imposes monastic discipline; it bars even flirting with ideologues.
I timed my birth badly, but those entering the field today should see the trainwreck as a gold mine. My generation’s errors are their opportunities. Silicon-Valley-powered soft science gives us the means of enforcing Mertonian norms of radical transparency in data collection, sharing, and interpretation. We can now run forecasting tournaments around Open Science Collaborations in which clashing schools of thought ante up their predictions on the outcomes of well-designed, large-sample-size studies, earning or losing credibility as a function of rigorously documented track records rather than who can sneak what by which sympathetic editors. Once incentives and norms are reasonably aligned, soft science should firm up fast. Too bad I cannot bring myself to believe in reincarnation.
Science Itself
Paul Bloom
Ragen Professor of Psychology and Cognitive Science, Yale University; author, Just Babies: The Origins of Good and Evil
The most exciting recent scientific news is about science itself: how it is funded, how scientists communicate with another, how findings get distributed to the public—and how it can go wrong. My own field of psychology has been Patient Zero here, with well-publicized cases of fraud, failures to replicate important studies, and a host of concerns, some of them well founded, about how we do our experiments and analyze our results.
There’s a lot to complain about with regard to how this story has played out in the popular press and over social media. Psychology—and particularly social psych
ology—has been unfairly singled out. The situation is at least as bad in other fields, such as cancer research. More important, legitimate concerns have been exaggerated and used by partisans on both the left and the right to dismiss any findings that don’t fit their interests and ideologies.
But it’s a significant story, and a lot of good can come from it. It’s important for non-scientists to have some degree of scientific literacy, and this means more than a familiarity with certain theories and discoveries. It requires an appreciation of how science works, and how it stands apart from other human activities, most notably religion. A serious public discussion of what scientists are doing wrong and how they can do better will not only lead to better science but will help advance scientific understanding more generally.
A Compelling Explanation for Scientific Misconduct
Leo M. Chalupa
Neurobiologist; Vice President for Research, George Washington University
There were plenty of remarkable discoveries this past year, my favorite demonstrating that running promotes the generation of new neurons in the aging brain. That prompted me to get back on the treadmill. But the big scientific news story for me was not any single event; rather, I have been struck by the emergence over the past several years of two related trends in the scientific world. Neither of these has made front-page news, although both are well known to those of us in the science business.
The first of these is the apparent increase in the reported incidence of research findings that cannot be replicated. The causes for this are myriad. In some cases, it’s simply because some vital piece of information, required to repeat a given experiment, has been inadvertently (or at times intentionally) omitted. More often, it is the result of sloppy work—poor experimental design, inappropriate statistical analysis, lack of appropriate controls.
Know This Page 22