by Andrew Leigh
I don’t want to suggest that we can’t learn anything from these kinds of ‘natural experiments’. To prove the point, these examples were drawn from my own academic work. I’ve devoted years of my life to working on these and other non-randomised studies. In each case, my co-authors and I did our best to find a credible counter-factual. But all of these studies are limited by the assumptions that the methods required us to make. New developments in non-randomised econometrics – such as machine learning – are generally even more complicated than the older approaches.34 As economist Orley Ashenfelter notes, if an evaluator is predisposed to give a program the thumbs-up, statistical modelling ‘leaves too many ways for the researcher to fake it’.35
That’s why one leading econometrics text teaches non-random approaches by comparing each to the ‘experimental ideal’.36 Students are encouraged to ask the question: ‘If we could run a randomised experiment here, what would it look like?’ Another novel approach is to take data from a properly conducted randomised trial, and pretend that we wanted to run a non-randomised evaluation. With the randomised trial as our yardstick, this trick lets us see how closely a natural experiment matches the true result. In the case of job training, researchers found that someone who evaluated the program without a randomised trial would have got the wrong answer.37 Similarly, non-randomised studies suggested that free distribution of home computers had large positive effects on school students’ test scores – but randomised evaluations showed that they in fact had little benefit.38
Having a better counterfactual is why randomised trials are often described as the ‘gold standard’. It’s always struck me as an odd compliment, given that no serious economist wants to go back to the literal gold standard.39 I think what they mean is that in the gruelling evaluation event, randomised trials get the gold medal. Indeed, many researchers support the notion of an evidence hierarchy, with randomised trials at the top of the dais.40 For example, the nonpartisan US Congressional Budget Office prioritises random assignment studies in assessing evidence.41
*
Most of us have had the experience of reading a surprising academic finding in the newspaper. Journalists and editors love those studies that find a quirky relationship or overturn conventional wisdom. If it prompts breakfast conversations that start, ‘Darling, you won’t believe this . . .’ it’s sure to be popular.
In 2000, a supermarket experiment found that customers were more likely to buy a new kind of jam when they were offered a choice between six jams than if they were offered twenty-four jams.42 The paper has since been cited thousands of times and used as evidence that many of us are overwhelmed by choice.43 The research also spurred dozens of follow-up experiments on ‘the paradox of choice’. A decade after the initial study appeared, a team of psychologists collated as many of these replication experiments as they could find.44 Among the fifty replication studies, a majority went in the opposite direction from the original jam choice experiment. Averaging all the results, the psychologists concluded that the number of available options had ‘virtually zero’ impact on customer satisfaction or purchases.
Novelty excites. Even those who manage dusty academic journals can be tempted to publish a counterintuitive result. Perversely, because editors like to publish unexpected results, and because authors like to publish in the best places, prestigious journals end up publishing too many idiosyncratic findings.
One smoking gun emerged when statisticians looked at how far away published papers were from being statistically insignificant. In social science, a standard rule of thumb for determining whether a result is due to luck is the ‘95 per cent significance’ rule. At this threshold, the chance of mistakenly identifying a statistical relationship where none truly exists is 5 per cent. If your study shows a result that is statistically significant at the 95 per cent level, then many journal editors will believe it. If your result falls below the 95 per cent level, editors are more likely to throw the paper into the rejection bin.
Most academics want their students to aspire to better than a bare pass on the test. But in the case of statistical significance, it turned out that a surprising number of published papers in social science were only just scraping through. Analysis of research published in top sociology, finance, accounting, psychology and political science journals turned out to contain a plethora of results that hold true at precisely the 95 per cent level.45 In other words, these studies pass a 95 per cent test, but would fail a 96 per cent test. This troubling finding immediately suggested that up to 5 per cent of published results might be due to luck, rather than statistical significance.
Worse still, if researchers were rerunning their analysis with different specifications until they got a result that was significant at the 95 per cent level (a practice known as ‘P-hacking’), then the resulting research might be even more error-prone. An unscrupulous academic who started each project with twenty junk theories could reasonably expect that mere luck would confirm that one of them was significant at the 95 per cent level. Discard the other nineteen and – voila! – there’s your publishable result.
Another clue to the problem in published social science research came when researchers replicated their colleagues’ work. In 2011 psychologist Brian Nosek set about persuading a team of scholars to replicate published papers in some of his field’s best journals. Over the next three years, 270 academics collaborated in replicating 100 psychology studies.46 The result: only about one-third of the findings held up.
Similarly worrying results have emerged in other disciplines. In a subfield of genetics, biologists replicated only one out of nine published papers.47 In oncology and haematology, medical researchers had the same success rate – just one in nine.48 In macroeconomics, only half the chosen studies successfully replicated.49 One critic, John Ioannidis, points to the way that outside funding, competition between academics, and researchers’ ability to cherrypick results can skew findings. His conclusion: ‘Most published research findings are false.’50
In my view, Ioannidis’s verdict is too pessimistic. Still, it’s no surprise that people have begun talking about a ‘crisis’ in replicability.51 For academia to retain its credibility, disciplines need to do a better job of ensuring that fewer findings are overturned. One of the best ways of doing this is to encourage more replications. And the easiest studies to replicate are randomised trials.
Replication can be done in a range of ways. If we are worried that the success of an exercise program might be due to a charismatic trainer, then it would be useful to replicate the study with a different coach. If we think an anti-violence program might depend on how strictly crime is penalised, then it would be valuable to replicate it in a jurisdiction with softer laws.
One way to carry out replication is for a new research team to repeat the study: an approach that one foundation describes as ‘If at first you succeed, try again’.52 But another is for researchers to work together to conduct the same analysis in different places. For example, if we can randomise a tutoring program at the student level, then it may be useful to simultaneously run the trial across multiple schools, so we can compare the effectiveness of different teams.
In some fields, there is a powerful push towards doing trials in multiple locations. Australian medical researcher David Johnson told me that in his field of kidney disease, he wants to see an end to underpowered single-centre trials.53 ‘It’s seldom that clinical practice will ever be changed by a single-centre trial,’ Johnson argues. ‘The numbers tend to be small, and even if they’re big enough, you worry that they won’t generalise.’ For his kind of medical research, Johnson contends, the future lies in coordinated research across multiple centres.
Statistically, a randomised trial based in two or three countries should be accorded massively more weight than one based on just a single site. To see why, let’s go back to that 95 per cent cut-off for statistical significance. If we run a randomised evaluation of a junk program, then 5 per cent of the time – or one ti
me in twenty – the result will turn out to be statistically significant. Oops.
Now let’s see what happens when the finding is replicated. Suppose we test a new education program, using the conventional 95 per cent level of statistical significance. And suppose the program doesn’t actually work. In that case, a randomised trial will give rise to a positive finding by chance one time in twenty. But with two randomised trials, the odds of mistakenly getting a positive finding in both trials would fall to one in 400. With three randomised trials, the chance that all three would find that the program had a significant impact drops to one in 8000. A significant result is far less likely to be a fluke if it holds up in multiple places.54
The job of compiling the randomised evidence is done by Cochrane for medicine and the Campbell Collaboration for social policy. Topic by topic, these organisations compile ‘systematic reviews’ that distil the relevant randomised trials into a format that can be easily accessed by practitioners, policymakers and the public.
The internationalisation of randomised trials presents new opportunities for replication across countries. In the 1980s, nine out of ten randomised policy trials were conducted in the United States, so multi-country studies were difficult.55 But today, just three in ten randomised policy trials are carried out in the US. It’s a trend that should be welcomed by Americans and non-Americans alike. If a program or pharmaceutical passes randomised evaluation in several countries, you can be more confident it really works.
Wherever a replication is conducted, it’s crucial that the results are reported. If researchers conceal findings that run counter to conventional wisdom, then the rest of us may form a mistaken impression of the results of available randomised trials. Like a golfer who takes a mulligan on every hole, discarded trials can leave us in a situation where the scorecard doesn’t reflect reality.
One way of countering ‘publication bias’ is to require that studies be registered before they start – by lodging a statement in advance in which the researchers specify the questions they are seeking to answer. This makes it more likely that studies are reported after they finish. In medicine, there are fifteen major clinical trial registers around the world, including ones operated by Australia and New Zealand, China, the European Union, India, Japan, the Netherlands and Thailand. All of these trials are then aggregated by the World Health Organization into the International Clinical Trials Registry Platform, which contains details of around 400,000 medical trials.
In recent years, development economists have established the Registry for International Development Impact Evaluations, which lists over 100 studies. Political scientists have created the Experiments in Governance and Politics Network (which lists around 700 studies). Economists have created the American Economic Association’s Randomized Controlled Trials registry (which lists around 1500 studies). Unlike in medicine, most social science journals do not yet refuse to publish unregistered trials, nor is there a requirement for all results to be published. But it is likely that disciplines such as economics and political science will move that way in coming years. As an added incentive – and with a touch of irony – development economists offered the first 100 researchers who submitted a trial to their registry a chance to win a tablet computer.
Even in medicine, the practice of researchers does not always live up to the requirements of the registry. Since 2007 the US government has required all drug trials to be registered at Clinicaltrials.gov, and published within a year of data collection. Yet a recent analysis found that only one in seven registered trials had complied with this requirement.56 Even four years after data completion, less than half of all trials had publicly posted their results. Hopefully the revelation of these worrying statistics will improve compliance rates.
In 2006, researchers analysed all the available clinical trial data on Prozac-type antidepressants, and found that they were associated with a higher risk of suicide among teenagers.57 If the studies had been reported sooner, this disturbing finding might have been known earlier. Similarly, the decision by various advanced countries to stockpile the anti-influenza drug Tamiflu was made at a time when 60 per cent of the available studies had not reported their results.58 Analysis of these figures now suggests that Tamiflu may be less effective at reducing hospital admissions than previously thought. Randomised trials can’t help society make better choices if their results are buried.
One of the main purposes of requiring trials to be registered is to avoid researchers moving the goalposts. But a number of recent reviews have shown that a significant share of randomised trials in medicine engaged in ‘outcome switching’ – reporting measures that were different from those they registered at the beginning.59 The most egregious case involved a 1998 study of GlaxoSmithKline’s drug Paxil, whose registration said the trial would look at eight outcomes. None showed any impact, so the researchers then went fishing across nineteen more measures, of which four showed a significant effect. Paxil’s results were then reported as though these four measures had been their focus from the outset. In 2012 GlaxoSmithKline was fined US$3 billion for misreporting trial data on several of their drugs, including Paxil.60
*
‘I still remember the call like it was yesterday. It came from the man analysing the bloodwork. “Something’s not right,” he told me, “one of the girls has something going on with her white blood cells.” An hour later, he called back and said the girl had early-stage leukaemia. By the end of the day, her parents had driven her from Canberra to Sydney, and she was getting her first treatment. A year later, she was cancer free.’
Dick Telford is sitting in my office, chatting about a remarkable study he began in 2005, to look at the impact of physical exercise on child outcomes. If there’s a person who’s qualified to talk about fitness, it’s Telford. In the late 1960s he played Aussie Rules for Collingwood and Fitzroy, before switching to running. He’s run a marathon in 2 hours and 27 minutes, won a track medal at the World Masters Games, and has a PhD in exercise science from the University of Melbourne. Telford was the first sport scientist appointed to the Australian Institute of Sport, where he trained Rob de Castella, Lisa Ondieki, Martin Dent and Carolyn Schuwalow.
Now in his early seventies, Telford is whippet-thin and moves fluidly. He coaches a small squad of runners, which I occasionally join. He’s deeply engaged with research, with a strong focus on randomised trials. It’s not to build up his résumé, but because Telford cares deeply about understanding which programs work and which do not.
The way in which Telford’s school sport program worked provides insights into how to run an effective randomised trial. As we’ve seen, it isn’t always necessary for the control group to get nothing at all, with many medical randomised trials testing a new drug against the best available alternative. Comparing two treatments reduces the chance of study participants trying to switch groups, helps assuage political concerns and is ethically preferable in most instances.
The same approach can be taken in the case of non-medical research. When Telford began his 2005 trial, working in partnership with the Australian Institute of Sport and the Australian National University, everyone recognised that it would be wrong to deny children in the control group any access to school sport. So rather than comparing exercise to no exercise, the study compared high-quality physical education with regular school sports programs. And because children might have felt cheated if they saw their classmates receiving better sports programs, the randomisation was done between schools, rather than between pupils in the same school.
After extensive conversations with school principals, Telford and his colleagues chose twenty-nine Canberra primary schools, and randomly divided them into two groups – literally drawing school names out of a hat. Thirteen of the schools were selected to receive physical education instruction from specialist teachers, who worked with classroom teachers to provide a daily exercise program of balance, coordination and games. In the sixteen control group schools, students still did physical ed
ucation with their regular classroom teachers, but the sessions were fewer, shorter and less physically demanding. So rather than comparing the impact of exercise with no exercise, the randomised trial became a comparison of occasional sport with quality training.
After four years, children randomised into the treatment group had less body fat, lower cholesterol levels and better maths scores.61 Researchers are now following these children into adulthood, to see how doing more exercise as a child may affect their overall wellbeing. Ultimately, Telford would like to see the children in the study followed up into their retirement years. He knows he won’t be around to see the results, but feels it’s vital that we learn about the long-term effects of quality school sports programs.
Done well, randomised trials make the world a better place. As it happens, Dick Telford’s hunch about the benefits of quality school sports programs seem to be paying off. But even if the results had dashed his hopes, the study would still be valuable, as it would add to our stock of knowledge, perhaps inspiring other researchers to pursue different approaches. And even aside from the results, there’s a girl alive today in Canberra who beat leukaemia, perhaps only because she was lucky enough to be part of Dick Telford’s randomised trial.
12
WHAT’S THE NEXT CHANCE?
In his book on the history of science, David Wootton marks how we have progressed intellectually by setting out the beliefs of most well-educated Europeans in 1600.1 The leading minds of that time believed in werewolves, in witches who could summon storms, and in unicorns. They believed that dreams could predict the future, and that the sun revolved around the earth. They believed mice were spontaneously created within straw piles, and that rainbows were a sign from God. They believed that a victim’s body would bleed when the murderer was nearby. In Shakespeare’s day, these weren’t fringe notions – they were what the best-informed people of that era understood to be true.