by Andrew Leigh
Among the powerful beliefs of the age was alchemy – the notion that base metals such as lead could be turned into precious metals such as gold. For millennia, alchemy had occupied a significant portion of all scientific research efforts. Even Isaac Newton spent more time on alchemy than on physics, prompting Keynes to suggest that Newton was ‘not the first of the age of reason. He was the last of the magicians.’
What saw off alchemy was not a culture of experimentation. Quite the contrary: alchemists had been doing experiments for centuries. The critical shift was the movement from secretive, badly designed experiments to experiments that were rigorous and publicly reported. As David Wootton observes:
What killed alchemy was the insistence that experiments must be openly reported in publications which presented a clear account of what had happened, and they must then be replicated, preferably before independent witnesses. The alchemists had pursued a secret learning, convinced that only a few were fit to have knowledge of divine secrets and that the social order would collapse if gold ceased to be in short supply . . . Esoteric knowledge was replaced by a new form of knowledge, which depended both on publication and on public or semi-public performance. A closed society was replaced by an open one.2
By 1750, well-educated Europeans no longer believed in alchemy – nor, for that matter, witches, unicorns or werewolves. Today, experimentation and the open publication of results are why most of us can confidently reject hundreds of ideas that seem intuitively appealing – including phrenology, iridology, astrology, reiki, telepathy, water dowsing, dianetics and tongue maps. It’s why a majority of people believe in evolution in most advanced countries, including the United Kingdom, France, Germany, Japan, Denmark and Spain (though not the United States or Turkey).3 The scientific revolution not only transformed the way in which we view the world around us, but has underpinned massive improvements in medical research, increasing the length and quality of our lives.
Alas, too many areas of life – from business to policymaking – still look worryingly like alchemy. When the basis for judgement is a low-quality evaluation and the results are kept secret from the world, the process starts to look more like the search for the philosopher’s stone than rigorous analysis. When 1100 of the world’s top executives were asked to describe their decision-making process, fewer than one in three said that they placed most reliance on data and analysis.4 British economist Tim Harford is critical of politicians who ‘use statistics like a stage magician uses smoke and mirrors’.5 At its worst, he says, this can be like the ‘bullshitter’ described by philosopher Harry Frankfurt: a person who is worse than a liar because they do not even care about the truth – they are indifferent to whether their statements are true or false.6 Sure, it’s possible to lie with statistics – but even easier to lie without them.
Bringing randomised trials to policy involves what psychologist Donald Campbell called ‘the experimenting society’. Campbell envisaged this as ‘an honest society, committed . . . to self-criticism . . . It will say it like it is, face up to the facts, be undefensive.’7 Such a society ‘will be a nondogmatic society . . . The scientific values of honesty, open criticism, experimentation, willingness to change once-advocated theories in the face of experimental and other evidence.’8 As we have seen, this approach is epitomised by the founders of TOMS, who donated 60 million pairs of shoes to children in developing nations, encouraged a randomised evaluation and then changed their philanthropic approach in response to the disappointing results.
It’s not always easy to follow the scientific path. As physicist Richard Feynman once observed, ‘The first principle is that you must not fool yourself – and you are the easiest person to fool.’9 Great scientists present all the evidence, not just the data that supports their pet theories. The best scientists publish results regardless of how they turn out. Feynman contrasted scientific integrity with ‘cargo cult science’. Like the Pacific Islanders who once built sham runways in the hope of attracting cargo planes, bad science might look like real science. It might even produce temporary moments of fame and excitement. But its results will eventually be junked.
‘Just as randomised trials revolutionised medicine in the twentieth century,’ argue economists Esther Duflo and Michael Kremer, ‘they have the potential to revolutionise social policy in the twenty-first.’10 As British Nudge Unit head David Halpern puts it: ‘We need to turn public policy from an art to a science.’11 This means paying more attention to measurement, and admitting that our intuition might be wrong. Randomised trials flourish where modesty meets numeracy.
One of the big thinkers of US social policy, senator Daniel Patrick Moynihan, recognised that evaluations can often produce results which are solid rather than stunning. When faced with a proposed new program, Moynihan was fond of quoting Rossi’s Law. Moynihan whimsically called Judith Gueron, the pioneer of randomised social policy trials, ‘Our Lady of Modest but Positive Results’.12
Blockbuster movies are filled with white knights and magic bullets, moon shots and miracles. Yet in reality most positive change doesn’t happen suddenly. From social reforms to economic change, our best systems have evolved gradually. Randomised trials put science, business and government on a steady path to improvement. Like a healthy diet, the approach succeeds little by little, through a series of good choices. The incremental approach won’t remake the world overnight, but it will over a generation.13
The best medical thinkers embody this modest approach. As one medical dean told his first-year class on day one: ‘Half of what we teach you here is wrong. Unfortunately, we do not know which half.’14 David Sackett, a pioneer of evidence-based medicine, wrote: ‘The first sin committed by experts consists in adding their prestige and their position to their opinions, which give the latter far greater persuasive power than they deserve on scientific grounds alone.’15 Judah Folkman, once one of the world’s top cancer researchers, observed, ‘I learn more from my failures than from my successes.’16
The same holds in business. In advanced countries, more than half of all start-ups fail within five years.17 Venture capital investors make most of their returns from a small number of their firms. Changing market conditions undoubtedly play a part, but the best firms aren’t just lucky – they’re also more adept at creating a cycle of rigorous testing and improvement. As one academic study observes, ‘Entrepreneurship is fundamentally about experimentation because the knowledge required to be successful cannot be known in advance or deduced from some set of first principles.’18 Intuit founder Scott Cook aims to create a company that’s ‘buzzing with experiments’, and in which ‘failing is perfectly fine’.19 Whatever happens, Cook tells his staff, ‘you’re doing right because you’ve created evidence, which is better than anyone’s intuition’. Journalist Megan McArdle argues that America’s economic success is rooted in ‘failing well’, through institutions that encourage risk-taking, forgiveness and learning from mistakes.20
Policy is replete with examples in which ‘expert’ judgement is revealed to be at odds with the data. For example, when considering whether to build a new rail line or road, governments typically commission projections of how many people will use it. But when researchers go back years later to see how many people actually used the project, it turns out that road traffic projections exceed the number of cars that use the road, and rail patronage projections overestimate the number of passengers.21 In the case of rail, the expert projections were particularly flawed. Nine out of ten forecasts overestimate usage, with the average projection being wrong by a factor of two. As we saw from the street-paving experiment in the Mexican city of Acayucan, it is actually possible to run randomised trials of infrastructure provision. But even if governments choose not to drive down that road, it’s vital that we use evidence to build a better feedback loop.
Just as modesty is a great ally of randomised trials, overconfidence can be their enemy. The more certain experts are of their skill and judgement, the less likely they are to use d
ata. And yet we know from a range of studies that overconfidence is a common trait. Eighty-four per cent of Frenchmen think that they are above-average lovers.22 Ninety-three per cent of Americans think they are better-than-average drivers.23 Ninety-seven per cent of Australians rate their own beauty as average or better than average.24 In human evolution, overconfidence has proven to be a successful strategy.25 In our own lives, excess confidence can provide a sense of resilience – allowing us to take credit for successes while avoiding blame for failures.26
The problem is that we live in a world in which failure is surprisingly common. In medicine, we saw that only one in ten drugs that looks promising in lab tests ends up getting approval. In education, we saw that only one-tenth of the randomised trials commissioned by the US What Works Clearinghouse produced positive effects. In business, just one-fifth of Google’s randomised experiments helped them improve the product. Rigorous social policy experiments find that only a quarter of programs have a strong positive effect. Once you raise the evidence bar, a consistent finding emerges: most ideas that sound good don’t actually work in practice. As randomised trials take off in new areas – such as law and anti-terrorism – they may up-end conventional wisdom there too.27
In the end, good evaluation is nothing less than the search for truth. As Einstein famously put it, ‘I want to know the thoughts of God. Everything else is details.’ If there is a judgement day, I’m guessing that everyone who’s ever struggled to put together a good evaluation will take the opportunity to step up and ask the Almighty: ‘So, tell me, did it work or not?’
In the Woody Allen movie Annie Hall, two characters are arguing about the views of eminent philosopher Marshall McLuhan. Suddenly McLuhan steps into the scene and tells one of them he’s absolutely wrong. The other declares: ‘Boy, if life were only like this!’ For many important questions, randomised trials are as close as we’ll come to that Annie Hall moment of truth.
For those at the cutting edge of research, a central challenge is effectively melding theory with randomised evaluations, to build more accurate models of human behaviour. Sure, there will always be a place for testing whether people are more likely to open letters in red or blue envelopes. But the most valuable randomised trials are those that provide deeper insights. Discussing what he has learnt from running randomised trials in Liberia, Chris Blattman reflects that ‘instead of asking, “Does the program work?”, I should have asked, “How does the world work?”’28 By testing fundamental assumptions, Blattman argues, it is possible to produce insights that generalise across programs.
In a similar vein, economists Jens Ludwig, Jeffrey Kling and Sendhil Mullainathan use the example of understanding ‘broken windows policing’, a strategy which focuses on addressing low-level offences (such as fare evasion, littering or minor property damage) as a way of reducing more serious crime.29 The trio suggest that most researchers would probably set out to evaluate broken windows policing by identifying a subset of cities and randomly instituting broken windows policing strategies in half of them. But if we want to understand the fundamentals, they argue, a better approach would be to buy a few dozen used cars, break the windows in half of them, park them in randomly selected neighbourhoods and see if more serious crimes increase in response.30 They call the policing experiment a policy evaluation, and the car experiment a mechanism evaluation – because it goes to the deeper question of whether broken windows increase violent crime. Both kinds of randomised trials can be useful. A police chief might only care about whether the policy works, but social researchers should focus on experiments which provide the deepest insights.
Around the world, there are many creative ways that randomised trials are being institutionalised. In 2005 Mexico created the National Council for Evaluation of Social Development Policy, an autonomous body charged with building the evidence base on what works to reduce poverty. Like the Nudge Units that have been established at the heart of governments in many advanced nations, Mexico’s national council reflects that country’s goal of being a leader among developing nations in running randomised trials.
Another promising way of encouraging randomised trials is to promise more money for ideas that succeed. In 2010 entrepreneur Maura O’Neill and development academic Michael Kremer persuaded the US Agency for International Development to create a division called ‘Development Innovation Ventures’.31 Founded on the principle of ‘scaling proven successes’, the program operates a three-tiered funding process. The first round offers funding of up to US$150,000. If a project shows evidence of success – often through a randomised trial – it can be eligible to move up to the second round, with up to US$1.5 million on offer. Prove success in the second round, and the idea moves into the third round, eligible for up to US$15 million in funding from Development Innovation Ventures.
In federal systems, another practical way that governments have encouraged randomised trials is by the national government building randomised trials into state grants programs. This has become commonplace in US federal legislation.32 For example, the Second Chance Act, dealing with strategies to facilitate prisoner re-entry into the community, sets aside 2 per cent of program funds for evaluations that ‘include, to the maximum extent possible, random assignment . . . and generate evidence on which re-entry approaches and strategies are most effective’. The No Child Left Behind Act calls for evaluation ‘using rigorous methodological designs and techniques, including control groups and random assignment, to the extent feasible, to produce reliable evidence of effectiveness’. Legislation to improve child development via home visits directs the Department of Health and Human Services to ‘ensure that States use the funds to support models that have been shown in well-designed randomized controlled trials, to produce sizeable, sustained effects on important child outcomes such as abuse and neglect’.
Charitable foundations have a vital role to play too. In the United Kingdom, about a hundred education-related randomised trials are underway, mostly conducted by the Education Endowment Foundation.33 A key contribution of the foundation is not only to find out what works, but also to help people sort through the available studies. The Education Endowment Foundation ranks research findings from 5 (a randomised trial with strong statistical power and low attrition) to 0 (a study with no comparison group). Like the evidence hierarchy I proposed in Chapter 1, the foundation’s rating system is a simple way to sum up the reliability of a particular evaluation.34 By putting randomised trials at the top, it creates an additional incentive to raise the evidence bar. Several US foundations, including the Edna McConnell Clark Foundation, Results for America, the Laura and John Arnold Foundation, and Bloomberg Philanthropies, are taking a similar approach, focusing on funding randomised trials – or programs that have been proven in randomised trials to be effective.
When it comes to caring about ends over means, few people can beat US paediatrician David Olds. Olds began developing his nurse–family partnership program in the 1970s. For the next twenty years, he used randomised trials to refine the program. In 1996 Olds started rolling out the program across communities. But even now – decades after he began creating the program – Olds wants to see it put to the test. Specifically, anyone outside the United States who wants to license the nurse–family partnership program must agree to perform a randomised evaluation. After all, the impact of home visits might differ in Britain, the Netherlands or Canada. As Olds sums up his philosophy: ‘I want to solve a problem, not promote a program.’35
*
In 2008, people who had previously given to the development charity Freedom from Hunger received a letter asking for another donation.36 Each letter told the story of a poor Peruvian widow named Rita, and then asked for support in one of two ways. Half the letters said, ‘In order to know that our programs work for people like Rita, we look for more than anecdotal evidence. That is why we have coordinated with independent researchers to conduct scientifically rigorous impact studies of our programs.’ The other half simply assert
ed, ‘But Freedom from Hunger knows that women like Rita are ready to end hunger in their own families and in their communities.’
In effect, the economists were running a randomised trial to test whether donors cared that a program was backed by randomised trials. On average, they found no impact – or, as they summed it up, no effect of effectiveness on donors. But when the results were broken down, the researchers found that including information on impact raised donation rates among large donors, while decreasing generosity among small donors. They concluded that among those who were simply looking for a warm glow, mentioning evaluation raised the spectre that not all aid might be effective. But among altruists, knowing that a program had a large impact made it more attractive.
The lesson of the Freedom from Hunger study is that we don’t just need more randomised trials – we also need to do a better job of demanding strong evidence. The more we ask the question ‘What’s your evidence?’, the more likely we are to find out what works – and what does not. Scepticism isn’t the enemy of optimism: it’s the channel through which our desire to solve big problems translates into real results. If we let our curiosity roam free, we might be surprised how much we can learn about the world, one coin toss at a time.
TEN COMMANDMENTS FOR RUNNING YOUR OWN RANDOMISED TRIAL
Conducting a successful randomised trial requires an unusual mix of talents. The best randomistas possess technical skills, operational wisdom, political savvy and courage.1
Ready to try it? Here are ten steps you should consider.
1. Decide what you want to test.
The simplest approach is to test a new intervention against a control group that gets nothing. Other studies run a horserace between two or more interventions. Crossover randomised trials combine multiple interventions. For example, a program to support self-employment might offer training, a cash grant, both or neither. If an intervention has an immediate impact, you might even be able to turn it on and off at random intervals. For example, to test whether a bedtime routine reduces insomnia, randomly assign it across half your evenings for the next month, then use a smartphone app to measure the quality of each night’s slumber.