by Andrew Leigh
I was struck by Garner’s willingness to let the data speak – a contrast to the brash overconfidence of many people who have created social programs. His study had cleared eight ethics committees, he told me, and had been approved because it answered a genuinely open question in the medical literature.
When the results were published, they showed that whether the coin came up heads (paramedics and ambulance) or tails (trauma physician and helicopter), there was no significant difference in the odds that a patient would survive.8 Ideally, the researchers would have liked to have more patients in the study, but were forced to stop the trial when the state government changed its protocol. The government, confident that physicians provided better care than paramedics, decided to dispatch physicians to as many head injury patients as possible.9 With their own study cut short, the researchers are urging further randomised trials of head injury treatment, so we can get a better answer about what works.
I’ve thought a lot about the ethics of the Head Injury Retrieval Trial. My brother Tim crashed his motorbike during that time, suffering a serious head injury (thankfully, he recovered without any lasting injuries). There’s no way of me knowing whether Tim was part of Alan Garner’s study – whether the paramedics who treated him were sent because of the toss of a coin – but I’d be comfortable if he had been part of the research. I know that the best way to improve emergency services is through gathering rigorous evidence about treatment methods.
I’m not alone in this attitude. Surveys of politicians in Australia and the United Kingdom reveal that about seven out of ten are supportive of controlled experiments, and most believe that randomised trials will become more common in future.10 Phil Ames and James Wilson, who conducted the Australian surveys, note that only one-tenth of politicians in these countries are concerned about the cost of randomised trials. Correctly, most politicians aren’t fearful that quality evaluation will bust the budget.
When parliamentarians are probed on their misgivings, the chief concern is fairness. Half of Australian politicians and one-third of British politicians worry that randomised trials are unfair.11 Reacting to the British survey results, medical writer Ben Goldacre concludes: ‘We need to get better at helping them to learn more about how randomised controlled trials work . . . Many members of parliament say they’re worried that randomised controlled trials are “unfair”, because people are chosen at random to receive a new policy intervention: but this is exactly what already happens with “pilot studies”, which have the added disadvantage of failing to produce good quality evidence on what works, and what does harm.’12
Rejecting randomised trials on the grounds of unfairness also seems at odds with the fact that lotteries have been used in advanced countries to allocate school places, housing vouchers and health insurance, to determine ballot order, and to decide who gets conscripted to fight in war. After World War II, the Australian government even used a lottery to assign taxi licences among returned servicemen.13 Indeed, one of the political appeals of lotteries is equity: before the drawing, everyone has the same chance of being picked. Yet, oddly, the ethical bar seems higher in the case of randomised trials than lotteries.
Other countries face even bigger challenges in running experiments. For example, the French constitution demands that all citizens be treated equally. This meant that the government had to change the constitution before France could conduct randomised trials.14
In my experience, most people are morally comfortable with a website that randomly tweaks its layout, a supermarket that randomly changes the shelf placement of its products, or a political campaign that randomly chooses which people to telephone. But when it comes to providing financial incentives to bribe a driving licence tester, or studying how best to treat an unconscious motorbike rider, the ethical questions become more difficult. In these cases, randomistas must think hard about the morality of their research. In the words of Elizabeth Linos, the head of the US Behavioural Insights Team, ‘it is important to clarify that we take ethics seriously, and mean it – not just to persuade’.15
For research conducted by universities, a key ethical safeguard is the requirement for studies to be reviewed by ethics committees, also known as institutional review boards. Ethical review was partly spurred by the 1946 Nuremberg trials, which saw dozens of Nazi doctors convicted of performing experiments on concentration camp prisoners without their consent. Another significant milestone was the revelation that the government had deliberately infected with syphilis 400 African-American men in Tuskegee, Alabama. From these atrocious ethical breaches, the World Medical Association passed the 1964 Declaration of Helsinki, which recommended informed consent and ethical review.
In 1978 the US government’s Belmont Report set out three principles that would guide ethical review: respecting all people; maximising the benefits and minimising the risks; and doing justice between groups of people. Australia, Canada and other countries now apply similar principles to their ethics review processes.16
My own experience of seeking ethical clearance at the Australian National University was a positive one. Although the committee members were not experts in the research area, they had thought carefully about how to reduce the costs on participants and ensure consent wherever possible.
Increasingly, social science experiments today are adopting processes that have long been standard in medicine. If the stakes are high, researchers should engage an expert panel to periodically monitor the trial and call it quits if the results are unequivocal. These oversight committees, also called ‘data and safety monitoring boards’, sometimes stop experiments because the treatment group is doing far better than the control group. However, they can also step in if the treatment group is unexpectedly doing worse than the control group.
Social scientists are also following medical researchers in comparing new interventions against an appropriate baseline. If an effective treatment already exists, the right thing to do will often be to pit a new treatment against the standard treatment, rather than against a ‘do nothing’ approach.
We have already seen a range of examples in which ethical concerns can be addressed by careful study design. If it is impractical to deliver a program to all communities at once, then why not randomise the rollout so that we can learn something about its impact? In the case of conditional cash transfers in Mexico, village schools in Afghanistan and biometric identifiers in India, everyone eventually received the program. The only difference was that rather than letting political or bureaucratic factors determine the rollout schedule, it was explicitly randomised, and the results used to inform policy.
Another approach, which can be used to evaluate ongoing programs, is to let the control group keep accessing the service while trying to increase the take-up rate among the treatment group. This is known as an ‘encouragement design’. In one such study, the University of Wollongong wanted to test the value of its student support services, but without barring any students from signing up if they were keen. The evaluators randomly sent text messages and emails to some students, informing them that they could win a $1000 retail voucher if they attended a peer-assisted study session.17 As a result of this encouragement, students in the treatment group attended study sessions for an average of 30 minutes per student longer than students in the control group. The researchers could now use this randomly induced difference to test the impact of the program on student performance. They concluded that getting extra support services made no significant difference to final grades.
It’s vital to think carefully about the ethics of every randomised trial. Yet history reminds us that failing to conduct a rigorous evaluation can sometimes be the most unethical approach. In the 1950s, West German scientists discovered a drug that helped to inhibit morning sickness. Soon, it was being taken by pregnant women in more than forty countries, with the manufacturer assuring them that the drug ‘can be given with complete safety to pregnant women and nursing mothers without adverse effect on mother or child’. B
ut when the manufacturers sought market approval in the United States, they struck a snag. Frances Kelsey, a newly hired employee at the Food and Drug Administration, noticed that the drug seemed to paralyse the peripheral nerves in some patients. Before the product could be approved, she requested further studies on its side effects, to prove that it did not affect the developing embryo. The frustrated manufacturer hoped to have the product on the American market as quickly as possible. In Kelsey’s recollection, they told her she was ‘depriving people of this thing’.18 The company appealed to her superiors to overrule Kelsey’s decision. But the management of the Food and Drug Administration backed her call for proper clinical trials.
In 1961 new research emerged showing that the ‘wonder drug’ was causing babies to be born limbless, or with their arms and legs as short stumps. In countries where it had been approved for sale – including the United Kingdom, Germany and Canada – more than 10,000 babies were born with deformities. Only about half survived. Yet thanks to Frances Kelsey’s demand for evidence, virtually no American babies were affected, because thalidomide was never approved for sale. Today, the Food and Drug Administration bestows an annual ‘Kelsey Award’ for excellence and courage in protecting public health.
Frances Kelsey’s actions are a reminder that ethical concerns about randomisation cut both ways. If the intervention helps, then a randomised trial leaves the treatment group better off than the control group. But if the intervention does harm, then failing to evaluate it properly can leave everyone worse off. While a rigorous demand for evidence saved thousands of American babies from the harm of thalidomide, there was no approval process required before governments began rolling out Scared Straight – the program we now know increased delinquency rates. If Frances Kelsey had been in charge of properly evaluating Scared Straight before it commenced, fewer lives would have been blighted by crime and imprisonment.
The best randomistas are passionate about solving a social problem, yet sceptical about the ability of any particular program to achieve its goals. Launching an evaluation of her organisation’s flagship program, Read India, Rukmini Banerji, told the audience: ‘And of course [the researchers] may find that it doesn’t work. But if it doesn’t work, we need to know that. We owe it to ourselves and the communities we work with not to waste their and our time and resources on a program that does not help children learn. If we find that this program isn’t working, we will go and develop something that will.’19
Not everyone shares Banerji’s openness to high-quality evaluation. When researcher Tess Lea sought to run a randomised trial of an online literacy tool in the Northern Territory, she hoped to improve reading standards among Indigenous Australians. But the evaluation of the ‘ABRACADABRA’ reading program was criticised on the basis that Indigenous children might not learn well with computers, that Indigenous children should not be tested using non-Indigenous tests, that the program was designed by Canadians, and that ‘the proposition to pursue experimental research was inherently racist’.20 The program turned out to improve student literacy, but Lea stated publicly that she would never again attempt a randomised trial in Indigenous education.21 It’s an unhappy result, given that less than a tenth of Indigenous programs have been subjected to an evaluation of any kind, let alone a randomised trial.22
The contrast between Rukmini Banerji’s support for evaluating her own program and the opposition that Tess Lea faced reminds us of just how important it is to distinguish the means from the ends. This approach is sometimes referred to as ‘Rossi’s Law’ (named after sociologist Peter Rossi), which states: ‘The better designed the impact assessment of a social program, the more likely is the resulting estimate of net impact to be zero.’23 Rossi’s Law does not mean we should give up hope of changing the world for the better. But we ought to be sceptical of anyone peddling panaceas. The belief that some social programs are flawed should lead to more rigorous evaluation and patient sifting through the evidence until we find a program that works.
In some cases, ethical concerns are grounded in strong science. For example, the evidence of a link between smoking and lung cancer is so strong that it would be unethical to randomise participants to receive free cigarettes. But in other cases, ethical objections turn out to be a smokescreen – used merely to defend ineffective programs from appropriate scrutiny.
Archie Cochrane, one of the pioneers of medical randomised trials, once came up with a novel trick to unmask such concerns. Presenting the results of a randomised evaluation of coronary care units, Cochrane faced an audience of cardiologists who had vehemently opposed the use of home care over hospital care. As economics writer Tim Harford tells the story, the study’s early findings favoured home care, but Cochrane mischievously switched the results. When shown results indicating that hospitals were safer than home care, the cardiologists demanded that his ‘unethical’ study stop immediately. ‘He then revealed the truth and challenged the cardiologists to close down their own hospital units without delay. There was dead silence.’24
*
As anyone who has eaten cafeteria food knows, things that work well on a small scale are not necessarily so tasty on a larger scale.25 Anytime we’re hoping to learn something from randomised trials, we need to consider whether the intervention is a boutique program or a mass-market one. For example, many early childhood randomised trials involve highly trained teachers working with extremely disadvantaged toddlers. But as these programs scale up, they are likely to recruit teachers with lower qualifications and less experience, and children from more affluent backgrounds. It would be a mistake to think that the huge benefit–cost ratio of Perry Preschool (which returned $7 to $12 in benefits for every $1 of spending) would necessarily translate to a population-wide program.
Something else also happens when a trial is scaled up: we begin to see whether its successes are coming at the expense of those outside the program. For example, suppose researchers designed a program aimed at teaching teenagers to confidently make eye contact in job interviews. It might be the case that the program helped its participants find jobs, but only at the expense of other jobseekers. Or it might be that the program actually increased the overall employment rate. Running a randomised trial with a few hundred participants would give us what economists call the ‘partial equilibrium’ effect. But only by randomising across labour markets – for example, by randomly choosing to run the program in some cities and not in others – could we gauge the ‘general equilibrium’ effects. For example, the partial equilibrium effect might look at whether a program helps its participants skip a few places up the queue, while the general equilibrium impact is whether it makes the whole line move faster. It’s encouraging if a program helps its participants, but it’s even better if these gains don’t come at someone else’s expense.
Good randomistas also think carefully about what those in the control group are doing. In medical research, the answer can be as simple as ‘getting the placebo drug’. But in social programs, people who miss out on the treatment may go looking for something similar. Until recently, randomised evaluations of the Head Start early childhood program tended to produce fairly small impacts – considerably less than the effects measured from the first preschool demonstration programs, such as Perry Preschool.26
The difference was in the control group. In the early 1960s, when the Perry Preschool study took place, there were no other preschool options for low-income families. But in recent decades, early childhood programs have proliferated across US cities. So while the older, Perry Preschool, results show the impact of preschool compared with parental care, the newer, Head Start, results show the impact of one preschool compared with another preschool.27 Realising that the control group had sought out other preschool options revealed that Head Start’s true benefit–cost ratio was almost twice as large as had been previously estimated.28 Atoms in a laboratory experiment don’t mind if they get put in the control test tube – but humans who get put in the control group may go looking
for another option.
Participants who search for alternatives pose a challenge, but the randomistas still start ahead of researchers using non-experimental methods, because they have a more credible counterfactual. To see this, it’s worth briefly reviewing a few ways that economists try to devise a counterfactual when they do not have a randomised evaluation.
One form of non-randomised evaluation is to study differences across regions. For example, if a policy is applied in a single state, we might use people in the rest of the country as the counterfactual group.29 That would let us ask the question: when the state policy changes, how does it affect the gap between outcomes in that jurisdiction and the rest of the nation?
Another way economists try to find a similar comparison group is to look at sharp borders. In studying the impact of school quality on house prices, we can study what happens when the school catchment boundary runs down the middle of the street.30 If people on one side have access to a more desirable public school, how does this affect house prices?
In the absence of a randomised evaluation, timing discontinuities can be used too. Attempting to estimate the impact of education on earnings, we can compare people born just before the school age cut-off with people born just afterwards.31 Suppose we compare someone born on the cut-off date for school entry with a person born the next day. If both people drop out of school at the same age, then a single day’s difference in birth timing will lead to nearly a year’s less education for the younger person.
Another trick researchers use is to look for naturally occurring randomness. For example, if we’re interested in the impact of economic growth on whether dictators get thrown out, we might look at changes in growth caused by annual variation in rainfall.32 Or if we wanted to see how much public works programs create jobs, we might look for instances in which the spread of investment was driven by political pork-barrelling rather than local need.33