Book Read Free

The Scientific Attitude

Page 16

by Lee McIntyre


  Examples surround us. Irreproducible studies on morning TV shows that report correlations between eating grapefruit and having an innie belly button, or between coffee and cancer, rightly make us suspicious. The definitive example, however, was given in Simmons’s original paper, where researchers were able—by judicious manipulation of their degrees of freedom—to prove that listening to the Beatles song “When I’m 64” had an effect on the listener’s age. Note that researchers did not show that listening to the song made one feel younger; they showed that it actually made one younger.37 Demonstrating something that is causally impossible is the ultimate indictment of one’s analysis. Yet p-hacking is not normally considered to be fraud.

  In most cases researchers could be honestly making necessary choices about data collection and analysis, and they could really believe they are making the correct choices, or at least reasonable choices. But their bias will influence those choices in ways that researchers may not be aware of. Further, researchers may simply be using the techniques that “work”—meaning they give the results the researcher wants.38

  Add to this the pressures of career advancement, competition for research funding, and the “publish or perish” environment to win tenure or other career advancement at most colleges and universities, and one has created an ideal motivational environment to “make the correct choices” about one’s work. As Uri Simonsohn, one of the coauthors of the original study put it: “everyone has p-hacked a little.”39 Another commentator on this debate, however, sounded a more ominous warning in the title of his paper: “Why Most Published Research Findings Are False.”40

  But is all of this really so bad? Of course scientists only want to report positive results. Who would want to read through all of those failed hypotheses or see all the data about things that didn’t work? Yet this kind of selective reporting—some might call it “burying”—is precisely what some pseudoscientists do to make themselves appear more scientific. In Ron Giere’s excellent book Understanding Scientific Reasoning he outlines one such example, whereby Jeane Dixon and other “futurists” often brag of being able to predict the future, which they do by making thousands of predictions about the coming year, then reporting only the ones that came true at year’s end.41

  To many, all of this will seem to be a bastardization of the scientific attitude because, although one is looking at the evidence, it is being used in an inappropriate way. With no prior hypothesis in mind to test, evidence is gathered merely for the sake of “data” to be mined. If one is simply on a treasure hunt for “p”–for something that looks significant even though it probably isn’t—where is the warrant for one’s beliefs? One can’t help but think here of Popper and his admonition about positive instances. If one is looking for positive instances, they are easy to find! Furthermore, without a hypothesis to test, one has done further damage to the spirit of scientific investigation. We are no longer building a theory—and possibly changing it depending on what the evidence says—but only reporting correlations, even if we don’t know why they occur or may even suspect that they are spurious.

  Clearly, p-hacking is a challenge to the thesis that scientists embrace the scientific attitude. But unless one is willing to discard a large proportion of unreproducible studies as unscientific, what other choice do we have? Yet remember my claim that what makes science special is not that it never makes mistakes, but that it responds to them in a rigorous way. And here the scientific community’s response to the p-hacking crisis strongly supports our confidence in the scientific attitude.

  First, it is important to understand the scope of p-value, and the job that science has asked of it. It was never intended to be the sole measure of significance.

  [When] Ronald Fisher introduced the P value in the 1920s, he did not mean it to be a definitive test. He intended it simply as an informal way to judge whether evidence was significant in the old-fashioned sense: worthy of a second look. The idea was to run an experiment, then see if the results were consistent with what random chance might produce. … For all the P value’s apparent precision, Fisher intended it to be just one part of a fluid, non-numerical process that blended data and background knowledge to lead to scientific conclusions. But it soon got swept into a movement to make evidence-based decision-making as rigorous and objective as possible. … “The P value was never meant to be used the way it’s used today.”42

  Another problem with the p-value is that it is commonly misunderstood, even by the people who routinely use it:

  The p-value is easily misinterpreted. For example, it is often equated with the strength of a relationship, but a tiny effect size can have very low p-values with a large enough sample size. Similarly, a low p-value does not mean that a finding is of major … interest.43

  In other words, p-values cannot measure the size of an effect, only the probability that the effect occurred by chance. Other misconceptions surround how to interpret the probabilities expressed by a p-value. Does a study with a 0.01 p-value mean that there is just a 1 percent chance of the result being spurious?44 Actually, no. This cannot be determined without prior knowledge of the likelihood of the effect. In fact, according to one calculation, “a P value of 0.01 corresponds to a false-alarm probability of at least 11%, depending on the underlying probability that there is a true effect.”45

  All this may lead some to wonder whether the p-value is as important as some have said. Should it be the gold standard of statistical significance and publication? Now that the problems with p-hacking are beginning to get more publicity, some journals have stopped asking for it.46 Other critics have proposed various statistical tests to detect p-hacking and shine a light on it. Megan Head et al. have proposed a method of text mining to measure the extent of p-hacking. The basic idea here is that if someone is p-hacking, then the p-values in their studies will cluster around the 0.05 level, which will result in an odd shape if one curves the results.47

  The next step might be to change the reporting requirements in journals, so that authors are required to compute their own p-curves, which would enable other scientists to tell at a glance whether the results had been p-hacked. Other ideas have included mandatory disclosure of all degrees of freedom taken in producing the result of a paper, along with size of the effect and any information about prior probabilities.48 Some of this is controversial because researchers cannot agree on the usefulness of what is reported.49 And, as Head points out:

  Many researchers have advocated abolishing NHST [Null Hypothesis Significance Testing]. However, others note that many of the problems with publication bias reoccur with other approaches, such as reporting effect sizes and their confidence intervals or Bayesian credible intervals. Publication biases are not a problem with p-values per se. They simply reflect the incentives to report strong (i.e. significant) facts.50

  So perhaps the solution is full disclosure and transparency? Maybe researchers should be able to use p-values, but the scientific community who assesses their work—peer reviewers and journal editors—should be on the lookout. In their paper, Simmons et al. propose several guidelines:

  Requirements for authors:

  Authors must decide the rule for terminating data collection before data collection begins and report this rule in the article.

  Authors must collect at least 20 observations per cell or else provide a compelling cost-of-data collection justification.

  Authors must list all variables collected in a study.

  Authors must report all experimental conditions, including failed manipulations.

  If observations are eliminated, authors must also report what the statistical results are if those observations are included.

  If an analysis includes a covariate, authors must report the statistical reports of the analysis without the covariate.

  Guidelines for reviewers:

  Reviewers should ensure that authors follow the requirements.

  Reviewers should be more tolerant of imperfection in results.

 
; Reviewers should require authors to demonstrate that their results do not hinge on arbitrary analytic decisions.

  If justifications of data collection or analysis are not compelling, reviewers should require the authors to conduct an exact replication.51

  And then, in a remarkable demonstration of the efficacy of these rules, Simmons et al. went back to their bogus hypothesis—about the Beatles song that “changed” the listener’s age—and redid it according to the above guidelines … and the effect disappeared. Can one imagine any pseudoscience taking such a rigorous approach to ferreting out error?

  The amazing thing about p-hacking is that even though it reflects a current crisis in science, it still makes a case for the value of the scientific attitude. It would have been easy for me to choose some less controversial or embarrassing example to demonstrate that the community of scientists have embraced the scientific attitude through their insistence on better quantitative methods. But I didn’t. I went for the worst example I could find. Yet it still came out that scientists were paying close attention to the problem and trying to fix it. Even though one of the procedures of science is flawed, the response from the scientific community has been pitch perfect: We’ve got this. We may not have it solved yet, and we may make more mistakes, but we are investigating the problem and trying to correct it. Although it may be in an individual scientist’s interests to be sloppy or lazy (or worse) and rely on p-hacking, this is an embarrassment to the scientific community at large. This is not who we are and we will correct this.

  At the end of his paper, Simmons says this:

  Our goal as scientists is not to publish as many articles as we can, but to discover and disseminate truth. Many of us—and this includes the three authors of this article—often lose sight of this goal, yielding to the pressure to do whatever is justifiable to compile a set of studies that we can publish. This is not driven by a willingness to deceive but by the self-serving interpretation of ambiguity, which enables us to convince ourselves that whichever decisions produced the most publishable outcome must have also been the most appropriate. This article advocates a set of disclosure requirements that imposes minimal costs on authors, readers, and reviewers. These solutions will not rid researchers of publication pressures, but they will limit what authors are able to justify as acceptable to others and to themselves. We should embrace these disclosure requirements as if the credibility of our profession depended on them. Because it does.52

  This warning is not lost on others. At the end of his own article “P-Hacking and Other Statistical Sins,” Steven Novella lists a set of general principles for scientific conduct, then brings up an odious comparison that should make all philosophers of science sit up and take notice:

  Homeopathy, acupuncture, and ESP research are all plagued by … deficiencies. They have not produced research results anywhere near the threshold of acceptance. Their studies reek of P-hacking, generally have tiny effect sizes, and there is no consistent pattern of replication. … But there is no clean dichotomy between science and pseudoscience. Yes, there are those claims that are far toward the pseudoscience end of the spectrum. All of these problems, however, plague mainstream science as well. The problems are fairly clear, as are the necessary fixes. All that is needed is widespread understanding and the will to change entrenched culture.53

  One can imagine no better testament—or call to arms—for the scientific attitude than that it aspires to create such a culture.54

  Peer Review

  We have already seen some of the advantages of peer review in the previous section. Where individuals may be tempted to take shortcuts in their work, they will be caught up short by representatives of the scientific community: the expert readers who agree to provide an opinion and comments on an anonymous manuscript, receiving for compensation only the knowledge that they are helping the profession and that their identity will be shielded.55 (There is also the advantage of getting to see new work first and having the chance to influence it through one’s comments.) But that is it. It is customarily shocking to outsiders to realize that most scientific review work is done for free, or at most for minimal compensation such as a free journal subscription or book.

  The process of peer review is de riguer at most scientific journals, where editors would shudder to share work that had not been properly vetted. In most instances, by the way, this means that every journal article is reviewed by not one but two experts, to provide an extra level of scrutiny. A check on a check. To the consternation of most authors, publication normally requires that both referees give the nod. And if any changes are proposed, it is common practice to send the paper back to the original reviewers once changes are made. Inferior work does sometimes slip through, but the point is that this system is one in which the scientific attitude is demonstrated through the principles of fair play and objectivity. It is not perfect, but it is at least a mechanism that allows the scientific community as a whole to have a hand in influencing individual research results before they are shared with the entire community.

  And there is also another level of scrutiny, for if a mistake is made there is always retraction. If an error is not caught until after a journal article is published, the mistake can be flagged or the whole article can be taken back, with due publicity for future readers. Indeed, in 2010 two researchers, Ivan Oransky and Adam Marcus, founded a website called RetractionWatch.com where one can find an up-to-date list of scientific papers that have been retracted. The website was created because of concerns that retracted papers were not being given enough publicity, which might lead other scientists mistakenly to build upon them. Again, one can bemoan the fact that approximately six hundred scientific papers per year are retracted, or one can applaud the scientific community for making sure that the problem does not become worse. By publicizing retractions, it might even provide an incentive for researchers not to end up on such a public “wall of shame.” On their blog, Oransky and Marcus argue that Retraction Watch contributes in part to the self-correcting nature of science.

  Although many see retraction as an embarrassment (and it surely is for the parties involved), it is yet another example of the scientific attitude in action. It is not that mistakes never occur in science, but when they do there are well-publicized community standards by which these errors can be rectified. But it is important to note, although it is sometimes lost on the larger nonscientific community, that article retraction does not necessarily indicate fraud. It is enough merely for there to be an error large enough to undermine the original finding. (In some cases this is discovered when other scientists try to reproduce the work, which will be discussed in the next section.) The example of p-hacking alone surely indicates that there is enough to keep the watchdogs busy. But fraud is the ultimate betrayal of the scientific attitude (and as such it will be discussed in its own chapter). For now, it is enough to note that article retraction can occur for many reasons, all of which credit the vigilance of the scientific community.

  Peer review is thus an essential part of the practice of science. There is only so far that individual scientists can go in catching their own errors. Maybe they will miss some. Maybe they want to miss some. Having someone else review work before publication is the best possible way to catch errors and keep them from infecting the work of others. Peer review keeps one honest, even if one was honest in the first place. Remember that groups outperform individuals. And groups of experts outperform individual experts. Even before these principles were experimentally proven, they were incorporated into the standard of peer review. What happens when an idea or theory does not go through peer review? Maybe nothing. Then again, on the theme that perhaps we can learn most about science by looking at its failures, I propose that we here consider the details of an example I introduced in chapter 3, which demonstrates the pitfalls of forgoing community scrutiny before publication: cold fusion.

  To those who know only the “headline” version of cold fusion, it may be tempting to dismiss the whole episod
e as one of fraud or at least bad intent. With the previously noted book titles Bad Science and Cold Fusion: The Scientific Fiasco of the Century it is easy to understand why people would want to cast aspersions on the two original researchers, then try to put the whole thing on a shelf and claim it could never happen again. What is fascinating, however, is the extent to which this episode illuminates the process of science. Whatever the ultimate conclusion after it was over—and there was plenty of armchair judgment after upwards of $50 million was spent to refute the original results and a governmental panel was appointed to investigate56—the entire cold fusion fiasco provides a cautionary tale for what can happen when the customary practices of science are circumvented.

  The problems in this case are so thick that one might use the example of cold fusion to illustrate any number of issues about science: weakness in analytical methods, the importance of data sharing and reproducibility, the challenges of subjectivity and cognitive bias, the role of empirical evidence, and the dangers of interference by politics and the media. Indeed, one of the books written on cold fusion has a final chapter that details the numerous lessons we can learn about the scientific process based on the failures demonstrated by cold fusion.57 Here I will focus on only one of these: peer review.

 

‹ Prev