by Peter Watts
When that picture was neutral, their choices were purely random. When it was pornographic or scary, though, they tended to guess right more often than not. It wasn’t a big effect; we’re talking a hit rate of maybe 53% instead of the expected 50%. But according to the stats, the effect was real in eight out of nine experiments.
Now, of course, everyone and their dog is piling on to kick holes in the study. That’s cool; that’s what we do, that’s how it works. Perhaps the most telling critique is the only one that really matters; nobody has been able to replicate Bem’s results yet. That speaks a lot louder than some of the criticisms that have been leveled in recent days, at least partly because some of those criticisms seem, well, pretty dumb. (Bem himself responds to some of Alcock’s complaints9).
Let’s do a quick drive-by on a few of the methodological accusations folks have been making: Bem’s methodology wasn’t consistent. Bem log-transformed his data; oooh, maybe he did it because untransformed data didn’t give him the results he wanted. Bem ran multiple tests without correcting for the fact that the more often you run tests on a data set, the greater the chance of getting significant results through random chance. To name but a few.
Maybe my background in field biology makes me more forgiving of such stuff, but I don’t consider tweaking one’s methods especially egregious when it’s done to adapt to new findings. For example, Bem discovered that men weren’t as responsive as women to the level of eroticism in his initial porn selections (which, as a male, I totally buy; those Harlequin Romance covers don’t do it for me at all). So he ramped the imagery for male participants up from R to XXX. I suppose he could have continued to use nonstimulating imagery even after realising that it didn’t work, just as a fisheries biologist might continue to use the same net even after discovering that its mesh was too large to catch the species she was studying. In both cases the methodology would remain “consistent”. It would also be a complete waste of resources.
Bem also got some grief for using tests of statistical significance (i.e., what are the odds that these results are due to random chance?) rather than Bayesian methods (i.e., given that our hypothesis is true, what are the odds of getting these specific results?). (Carey gives a nice comparative thumbnail of the two approaches at The New York Times.10) I suspect this complaint could be legit. The problem I have with Bayes is that it takes your own preconceptions as a starting point: you get to choose up front the odds that psi is real, and the odds that it is not. If the data run counter to those odds, the theorem adjusts them to be more consistent with those findings on the next iteration; but obviously, if your starting assumption is that there’s a 99.9999999999% chance that precognition is bullshit, it’s gonna take a lot more data to swing those numbers than if you start from a bullshit-probability of only 80%. Wagenmakers et al tie this in to Laplace’s famous statement that “extraordinary claims require extraordinary proof” (to which we shall return at the end of this post), but another way of phrasing that is “the more extreme the prejudice, the tougher it is to shake”. And Bayes, by definition, uses prejudice as its launch pad.
Wagenmakers et al ran Bem’s numbers using Bayesian techniques, starting with standard “default” values for their initial probabilities (they didn’t actually say what those values were, although they cited a source). They found “substantial” support for precognition (H1) in only one of Bem’s nine experiments, and “substantial” support for its absence (H0) in another two (they actually claim three, but for some reason they seem to have run Bem’s sixth experiment twice). They then reran the same data using a range of start-up values that differed from these “defaults”, just to be sure, and concluded that their results were robust. They refer the reader to an online appendix for the details of that analysis. I can’t show you the figure you’ll find there (for reasons that remain unclear, Tachyon is strangely reluctant to break copyright law), but its caption reads, in part,
“The results in favor of H1 are never compelling, except perhaps for the bottom right panel.”
Let me state unequivocally that the “perhaps” is disingenuous to the point of outright falsehood. The bottom right panel shows unequivocally support for H1. So even assuming that these guys were on the money with all of their criticisms, even assuming that they’ve successfully demolished eight of Bem’s nine claims to significance—they’re admitting to evidence for the existence of precognition by their own reckoning. And yet they can’t bring themselves to admit it, even in a caption belied by its own figure.
To some extent, it was Bem’s decision to make his work replication-friendly that put this particular bullseye on his chest. He chose methods that were well-known and firmly established in the research community; he explicitly rejected arcane statistics in favor of simple ones that other social scientists would be comfortable with. (“It might actually be more logical from a Bayesian perspective to believe that some unknown flaw or artifact is hiding in the weeds of a complex experimental procedure or an unfamiliar statistical analysis than to believe that genuine psi has been demonstrated,” he writes. “As a consequence, simplicity and familiarity become essential tools of persuasion.”) Foreseeing that some might question the distributional assumptions underlying t-tests, he log-transformed his data to normalise it prior to analysis; this inspired Wagenmakers et al to wonder darkly “what the results were for the untransformed RTs—results that were not reported”. Bem also ran the data through nonparametric tests that made no distributional assumptions at all; Alcock then complained about unexplained, redundant tests that added nothing to the analysis (despite the fact that Bem had explicitly described his rationale), and about the use of multiple tests that didn’t correct for the increased odds of false positives.
This latter point is true in the general but not in the particular. Every grad student knows that desperate sinking feeling that sets in when their data show no apparent patterns at all, the temptation to inflict endless tests and transforms in the hope that please God something might show up. But Bem already had significant results; he used alternative analyses in case those results were somehow artefactual, and he kept getting significance no matter which way he came at the problem. Where I come from, it’s generally considered a good sign when different approaches converge on the same result.
Bem also considered the possibility that there might be some kind of bias in algorithms used by the computer to randomise its selection of pictures; he therefore replicated his experiments using different random-number generators. He showed all his notes, all the messy bits that generally don’t get presented when you want to show off your work in a peer-reviewed journal. He not only met the standards of rigor in his field: he surpassed them, and four reviewers (while not necessarily able to believe his findings) couldn’t find any methodological or analytical flaws sufficient to keep the work from publication.
Even Bem’s opponents admit to this. Wagenmakers et al explicitly state:
“Bem played by the implicit rules that guide academic publishing—in fact, Bem presented many more studies than would usually be required.”
They can’t logically attack Bem’s work without attacking the entire field of psychology. So that’s what they do:
“. . . our assessment suggests that something is deeply wrong with the way experimental psychologists design their studies and report their statistical results. It is a disturbing thought that many experimental findings, proudly and confidently reported in the literature as real, might in fact be based on statistical tests that are explorative and biased (see also Ioannidis, 2005). We hope the Bem article will become a signpost for change, a writing on the wall: psychologists must change the way they analyze their data.”
And you know, maybe they’re right. We biologists have always looked at those soft-headed new-agers over in the Humanities building with almost as much contempt as the physicists and chemists looked at us, back before we owned the whole genetic-engineering thing. I’m perfectly copacetic with the premise that psycholog
y is broken. But if the field is really in such disrepair, why is it that none of those myriad less-rigorous papers acted as a wake-up call? Why snooze through so many decades of hack analysis only to pick on a paper which, by your own admission, is better than most?
Well, do you suppose anyone would be eviscerating Bem’s methodology with quite so much enthusiasm if he’d concluded that there was no evidence for precognition? Here’s a hint: Alcock’s critique painstakingly picks at every one of Bem’s experiments except for #7. Perhaps that seventh experiment finally got it right, you think. Perhaps Alcock gave that one a pass because Bem’s methodology was, for once, airtight? Let’s let Alcock speak for himself:
“The hit rate was not reported to be significant in this experiment. The reader is therefore spared my deliberations.”
Evidently bad methodology isn’t worth criticising, just so long as you agree with the results.
This leads nicely into what is perhaps the most basic objection to Bem’s work, a more widespread and gut-level response that both underlies and transcends the methodological attacks: sheer, eye-popping incredulity. This is bullshit. This has to be bullshit. This doesn’t make any goddamned sense.
It mustn’t be. Therefore it isn’t.
Of course, nobody phrases it that baldly. They’re more likely to claim that “there’s no mechanism in physics which could explain these results.” Wagenmakers et al went so far as to claim that Bem’s effect can’t be real because nobody is bankrupting the world’s casinos with their psychic powers, which is logically equivalent to saying that protective carapaces can’t be adaptive because lobsters aren’t bulletproof. As for the whacked-out argument that there’s no theoretical mechanism in place to describe the data, I can’t think of a more effective way of grinding science to a halt than to reject any data that don’t fit our current models of reality. If everyone thought that way, earth would still be a flat disk at the center of a crystal universe.
Some people deal with their incredulity better than others. (One of the paper’s reviewers opined that they found the results “ridiculous”, but recommended publication anyway because they couldn’t find fault with the methodology or the analysis.) Others take refuge in the mantra that “extraordinary claims require extraordinary evidence”.
I’ve always thought that was a pretty good mantra. If someone told me that my friend had gotten drunk and run his car into a telephone pole I might evince skepticism out of loyalty to my friend, but a photo of the accident scene would probably convince me. People get drunk, after all (especially my friends); accidents happen. But if the same source told me that a flying saucer had used a tractor beam to force my friend’s car off the road, a photo wouldn’t come close to doing the job. I’d just reach for the Photoshop manual to figure out how the image had been faked. Extraordinary claims require extraordinary evidence.
The question, here in the second decade of the 21st Century, is: what constitutes an “extraordinary claim”? A hundred years ago it would have been extraordinary to claim that a cat could be simultaneously dead and alive; fifty years ago it would have been extraordinary to claim that life existed above the boiling point of water, kilometers deep in the earth’s crust. Twenty years ago it was extraordinary to suggest that the universe was not only expanding but that the rate of expansion was accelerating. Today, physics concedes the theoretical possibility of time travel (in fact, I’ve been led to believe that the whole arrow-of-time thing has always been problematic to the physicists; most of their equations work both ways, with no need for a unidirectional time flow).
Yes, I know. I’m skating dangerously close to the same defensive hysteria every new-age nutjob invokes when confronted with skepticism over the Healing Power of Petunias; yeah, well, a thousand years ago everybody thought the world was flat, too. The difference is that those nutjobs make their arguments in lieu of any actual evidence whatsoever in support of their claims, and the rejoinder of skeptics everywhere has always been “Show us the data. There are agreed-upon standards of evidence. Show us numbers, P-values, something that can pass peer review in a primary journal by respectable researchers with established reputations. These are the standards you must meet.”
How often have we heard this? How often have we pointed out that the UFO cranks and the Ghost Brigade never manage to get published in the peer-reviewed literature? How often have we pointed out that their so-called “evidence” isn’t up to our standards?
Well, Bem cleared that bar. And the response of some has been to raise it. All along we’ve been demanding that the fringe adhere to the same standards the rest of us do, and finally the fringe has met that challenge. And now we’re saying they should be held to a different standard, a higher standard, because they are making an extraordinary claim.
This whole thing makes me deeply uncomfortable. It’s not that I believe the case for precognition has been made; it hasn’t. Barring independent confirmation of Bem’s results, I remain a skeptic. Nor am I especially outraged by the nature of the critiques, although I do think some of them edge up against outright dishonesty. I’m on public record as a guy who regards science as a knock-down-drag-out between competing biases, personal more often than not.11 (On the other hand, if I’d tried my best to demolish evidence of precognition and still ended up with “substantial” support in one case out of nine, I wouldn’t be sweeping it under the rug with phrases like “never compelling” and “except possibly”—I’d be saying “Holy shit, the dude may have overstated his case but there may be something to this anyway . . .”)
I am, however, starting to have second thoughts about Laplace’s principle. I’m starting to wonder if it’s especially wise to demand higher evidentiary standards for any claim we happen to find especially counterintuitive this week. A consistently-applied 0.05 significance threshold may be arbitrary, but at least it’s independent of the vagaries of community standards. The moment you start talking about extraordinary claims you have to define what qualifies as one, and the best definition I can come up with is: any claim which is inconsistent with our present understanding of the way things work. The inevitable implication of that statement is that today’s worldview is always the right one; we’ve already got a definitive lock on reality, and anything that suggests otherwise is especially suspect.
Which, you’ll forgive me for saying so, seems like a pretty extraordinary claim in its own right.
Maybe we could call it the Galileo Corollary.
1 https://www.nytimes.com/2011/01/06/science/06esp.html
2 http://www.huffingtonpost.com/julia-moulden/do-we-have-one-extra-sens_b_808417.html
3 http://healthland.time.com/2011/01/12/wait-esp-is-real/
4 http://www.winnipegfreepress.com/opinion/westview/science-journals-in-decline-113189344.html
5 http://www.csicop.org/specialarticles/show/back_from_the_future
6 http://www.csicop.org/specialarticles/show/response_to_bems_comments
7 http://www.csicop.org/specialarticles/show/response_to_alcocks_back_from_the_future_comments_on_bem
8 http://deanradin.blogspot.com/2010/12/my-comments-on-alcocks-comments-on-bems.html
9 http://www.csicop.org/specialarticles/show/response_to_alcocks_back_from_the_future_comments_on_bem
10 http://www.nytimes.com/2011/01/11/science/11esp.html
11 https://www.rifters.com/crawl/?p=886
Why I Suck.
Blog June 6, 2013
I’ve just sat through an entire season—which is to say three measly episodes, in what might be the new SOP for the BBC (see Sherlock)—of this new zombie show called In the Flesh.
Yeah, I know. These days, the very phrase “new zombie show” borders on oxymoronic. And yet, this really is a fresh spin on the old paradigm: imagine that, years after the dead clawed their way out of the ground and started feasting on the living, we figured out how to fix them. Not cure, exactly: think diabetes or HIV, management instead of recovery. Imagine a drug that repairs the mind, eve
n if it can’t fix the rot or the pallor or the eyes.
Imagine the gradual reconnection of cognitive circuitry, and the flashbacks it provokes as animal memories reboot. Imagine what it must be like when the sudden fresh remembrance of people killed and eviscerated is regarded, clinically, as a sign of recovery.
In the Flesh imagines. It also imagines government-mandated reintegration of the recovering undead (“Partially-Deceased-Syndrome” is the politically-correct term; it comes replete with cheery pamphlets to help next-of-kin manage the transition). Contact lenses and pancake makeup to make the partly-dead more palatable to the communities in which they once lived. Therapy sessions in which the overwhelming guilt of freshly-remembered murder and cannibalism alternates with defiant self-justification: “We had to do it to survive. They blew our heads off without a second thought—they were protecting humanity! They get medals, we get medicated . . .” Hypertrophic Neighborhood Watch patrols who never let you forget that no matter how Human these creatures may seem now, a couple of missed injections is all it takes to turn them back into ravening monsters in the heart of our community . . .
What’s science fiction’s mission statement, again? Oh, right: to explore the social impact of scientific and technological change . Too much SF takes the Grand Tour Amusement Park approach, offers up an awesome parade of wonders and prognostications like some kind of futuristic freak show. It takes a show like In the Flesh to remind us that technology is only half of the equation, that the molecular composition of the hammer or the rpms of the chainsaw, in isolation, are of limited interest. Our mission hasn’t been accomplished until the hammer hits the flesh.