by Hannah Fry
So, on the basis of identical evidence in identical cases, a defendant could expect to walk away scot-free or be sent straight to jail, depending entirely on which judge they were lucky (or unlucky) enough to find themselves in front of.
That’s quite a blow for anyone holding on to hopes of courtroom consistency. But it gets worse. Because, not only do the judges disagree with each other, they’re also prone to contradicting their own decisions.
In a more recent study, 81 UK judges were asked whether they’d award bail to a number of imaginary defendants.10 Each hypothetical case had its own imaginary back-story and imaginary criminal history. Just like their counterparts in the Virginia study, the British judges failed to agree unanimously on a single one of the 41 cases presented to them.11 But this time, in among the 41 hypothetical cases given to every judge were seven that appeared twice – with the names of the defendants changed on their second appearance so the judge wouldn’t notice they’d been duplicated. It was a sneaky trick, but a revealing one. Most judges didn’t manage to make the same decision on the same case when seeing it for the second time. Astonishingly, some judges did no better at matching their own answers than if they were, quite literally, awarding bail at random.12
Numerous other studies have come to the same conclusion: whenever judges have the freedom to assess cases for themselves, there will be massive inconsistencies. Allowing judges room for discretion means allowing there to be an element of luck in the system.
There is a simple solution, of course. An easy way to make sure that judges are consistent is to take away their ability to exercise discretion. If every person charged with the same offence were dealt with in exactly the same way, precision could be guaranteed – at least in matters of bail and sentencing. And indeed, some countries have taken this route. There are prescriptive sentencing systems in operation at the federal level in the United States and in parts of Australia.13 But this kind of consistency comes at a price. For all that you gain in precision, you lose in another kind of fairness.
To illustrate, imagine two defendants, both charged with stealing from a supermarket. One is a relatively comfortable career criminal stealing through choice; the other is recently unemployed and struggling to make ends meet, stealing to feed their family and full of remorse for their actions. By removing the freedom to take these mitigating factors into account, strict guidelines that treat all those charged with the same crime in the same way can end up being unnecessarily harsh on some, and mean throwing away the chance to rehabilitate some criminals.
This is quite a conundrum. However you set up the system for your judges, you have to find a tricky balance between offering individualized justice and ensuring consistency. Most countries have tried to solve the dilemma by settling on a system that falls somewhere between the prescriptive extreme of US federal law and one based almost entirely on judges’ discretion – like that used in Scotland.14 Across the Western world, sentencing guidelines tend to lay down a maximum sentence (as in Ireland) or a minimum sentence (as in Canada) or both (as in England and Wales),15 and allow judges latitude to adjust the sentence up or down between those limits.
No system is perfect. There is always a muddle of competing unfairnesses, a chaos of opposing injustices. But in all this conflict and complexity an algorithm has a chance of coming into its own. Because – remarkably – with an algorithm as part of the process, both consistency and individualized justice can be guaranteed. No one needs to choose between them.
The justice equation
Algorithms can’t decide guilt. They can’t weigh up arguments from the defence and prosecution, or analyse evidence, or decide whether a defendant is truly remorseful. So don’t expect them to replace judges any time soon. What an algorithm can do, however, incredible as it might seem, is use data on an individual to calculate their risk of re-offending. And, since many judges’ decisions are based on the likelihood that an offender will return to crime, that turns out to be a rather useful capacity to have.
Data and algorithms have been used in the judicial system for almost a century, the first examples dating back to 1920s America. At the time, under the US system, convicted criminals would be sentenced to a standard maximum term and then become eligible for parolefn1 after a period of time had elapsed. Tens of thousands of prisoners were granted early release on this basis. Some were successfully rehabilitated, others were not. But collectively they presented the perfect setting for a natural experiment: could you predict whether an inmate would violate their parole?
Enter Ernest W. Burgess, a Canadian sociologist at the University of Chicago with a thirst for prediction. Burgess was a big proponent of quantifying social phenomena. Over the course of his career he tried to forecast everything from the effects of retirement to marital success, and in 1928 he became the first person to successfully build a tool to predict the risk of criminal behaviour based on measurement rather than intuition.
Using all kinds of data from three thousand inmates in three Illinois prisons, Burgess identified 21 factors he deemed to be ‘possibly significant’ in determining the chances of whether someone would violate the terms of their parole. These included the type of offence, the months served in prison and the inmate’s social type, which – with the delicacy one would expect from an early-twentieth-century social scientist – he split into categories including ‘hobo’, ‘drunkard’, ‘ne’er do-well’, ‘farm boy’ and ‘immigrant’.16
Burgess gave each inmate a score between zero and one on each of the 21 factors. The men who got high scores (between 16 and 21) he deemed least likely to re-offend; those who scored low (four or less) he judged likely to violate their terms of release.
When all the inmates were eventually granted their release, and so were free to violate the terms of their parole if they chose to, Burgess had a chance to check how good his predictions were. From such a basic analysis, he managed to be remarkably accurate. Ninety-eight per cent of his low-risk group made a clean pass through their parole, while two-thirds of his high-risk group did not.17 Even crude statistical models, it turned out, could make better forecasts than the experts.
But his work had its critics. Sceptical onlookers questioned how much the factors which reliably predicted parole success in one place at one time could apply elsewhere. (They had a point: I’m not sure the category ‘farm boy’ would be much help in predicting recidivism among modern inner-city criminals.) Other scholars criticized Burgess for just making use of whatever information was on hand, without investigating if it was relevant.18 There were also questions about the way he scored the inmates: after all, his method was little more than opinion written in equations. None the less, its forecasting power was impressive enough that by 1935 the Burgess method had made its way into Illinois prisons, to support parole boards in making their decisions.19 And by the turn of the century mathematical descendants of Burgess’s method were being used all around the world.20
Fast-forward to the modern day, and the state-of-the-art risk-assessment algorithms used by courtrooms are far more sophisticated than the rudimentary tools designed by Burgess. They are not only found assisting parole decisions, but are used to help match intervention programmes to prisoners, to decide who should be awarded bail, and, more recently, to support judges in their sentencing decisions. The fundamental principle is the same as it always was: in go the facts about the defendant – age, criminal history, seriousness of the crime and so on – and out comes a prediction of how risky it would be to let them loose.
So, how do they work? Well, broadly speaking, the best-performing contemporary algorithms use a technique known as random forests, which – at its heart – has a fantastically simple idea. The humble decision tree.
Ask the audience
You might well be familiar with decision trees from your schooldays. They’re popular with maths teachers as a way to structure observations, like coin flips or dice rolls. Once built, a decision tree can be used as a flowchart: taking a set of circumstances and as
sessing step by step what to do, or, in this case, what will happen.
Imagine you’re trying to decide whether to award bail to a particular individual. As with parole, this decision is based on a straightforward calculation. Guilt is irrelevant. You only need to make a prediction: will the defendant violate the terms of their bail agreement, if granted leave from jail?
To help with your prediction, you have data from a handful of previous offenders, some who fled or went on to re-offend while on bail, some who didn’t. Using the data, you could imagine constructing a simple decision tree by hand, like the one below, using the characteristics of each offender to build a flowchart. Once built, the decision tree can forecast how the new offender might behave. Simply follow the relevant branches according to the characteristics of the offender until you get to a prediction. Just as long as they fit the pattern of everyone who has gone before, the prediction will be right.
But this is where decision trees of the kind we made in school start to fall down. Because, of course, not everyone does follow the pattern of those who went before. On its own, this tree is going to get a lot of forecasts wrong. And not just because we’re starting with a simple example. Even with an enormous dataset of previous cases and an enormously complicated flowchart to match, using a single tree may only ever be slightly better than random guessing.
And yet, if you build more than one tree – everything can change. Rather than using all the data at once, there is a way to divide and conquer. In what is known as an ensemble, you first build thousands of smaller trees from random subsections of the data. Then, when presented with a new defendant, you simply ask every tree to vote on whether it thinks awarding bail is a good idea or not. The trees may not all agree, and on their own they might still make weak predictions, but just by taking the average of all their answers, you can dramatically improve the precision of your predictions.
It’s a bit like asking the audience in Who Wants To Be A Millionaire? A room full of strangers will be right more often than the cleverest person you know. (The ‘ask the audience’ lifeline had a 91 per cent success rate compared to just 65 per cent for ‘phone a friend’.?)21 The errors made by many can cancel each other out and result in a crowd that’s wiser than the individual.
The same applies to the big group of decision trees which, taken together, make up a random forest (pun intended). Because the algorithm’s predictions are based on the patterns it learns from the data, a random forest is described as a machine-learning algorithm, which comes under the broader umbrella of artificial intelligence. (The term ‘machine learning’ first came up in the ‘Power’ chapter, and we’ll meet many more algorithms under this particular canopy later, but for now it’s worth noting how grand that description makes it sound, when the algorithm is essentially the flowcharts you used to draw at school, wrapped up in a bit of mathematical manipulation.) Random forests have proved themselves to be incredibly useful in a whole host of real-world applications. They’re used by Netflix to help predict what you’d like to watch based on past preferences;22 by Airbnb to detect fraudulent accounts;23 and in healthcare for disease diagnosis (more on that in the following chapter).
When used to assess offenders, they can claim two huge advantages over their human counterparts. First, the algorithm will always give exactly the same answer when presented with the same set of circumstances. Consistency comes guaranteed, but not at the price of individualized justice. And there is another key advantage: the algorithm also makes much better predictions.
Human vs machine
In 2017, a group of researchers set out to discover just how well a machine’s predictions stacked up against the decisions of a bunch of human judges.24
To help them in their mission, the team had access to the records of every person arrested in New York City over a five-year period between 2008 and 2013. During that time, three-quarters of a million people were subject to a bail hearing, which meant easily enough data to test an algorithm on a head-to-head basis with a human judge.
An algorithm hadn’t been used by the New York judicial system during these cases, but looking retrospectively, the researchers got to work building lots of decision trees to see how well one could have predicted the defendants’ risk of breaking bail conditions at the time. In went the data on an offender: their rap sheet, the crime they’d just committed and so on. Out came a probability of whether or not that defendant would go on to violate the terms of their bail.
In the real data, 408,283 defendants were released before they faced trial. Any one of those was free to flee or commit other crimes, which means we can use the benefit of hindsight to test how accurate the algorithm’s predictions and the humans’ decisions were. We know exactly who failed to appear in court later (15.2 per cent) and who was re-arrested for another crime while on bail (25.8 per cent).
Unfortunately for the science, any defendant deemed high risk by a judge would have been denied bail at the time – and hence, on those cases, there was no opportunity to prove the judges’ assessment right or wrong. That makes things a little complicated. It means there’s no way to come up with a cold, hard number that captures how accurate the judges were overall. And without a ‘ground truth’ for how those defendants would have behaved, you can’t state an overall accuracy for the algorithm either. Instead, you have to make an educated guess on what the jailed defendants would have done if released,25 and make your comparisons of human versus machine in a bit more of a roundabout way.
One thing is for sure, though: the judges and the machine didn’t agree on their predictions. The researchers showed that many of the defendants flagged by the algorithm as real bad guys were treated by the judges as though they were low risk. In fact, almost half of the defendants the algorithm flagged as the riskiest group were given bail by the judges.
But who was right? The data showed that the group the algorithm was worried about did indeed pose a risk. Just over 56 per cent of them failed to show up for their court appearances, and 62.7 per cent went on to commit new crimes while out on bail – including the worst crimes of all: rape and murder. The algorithm had seen it all coming.
The researchers argued that, whichever way you use it, their algorithm vastly outperforms the human judge. And the numbers back them up. If you’re wanting to incarcerate fewer people awaiting trial, the algorithm could help by consigning 41.8 per cent fewer defendants to jail while keeping the crime rate the same. Or, if you were happy with the current proportion of defendants given bail, then that’s fine too: just by being more accurate at selecting which defendants to release, the algorithm could reduce the rate of skipping bail by 24.7 per cent.
These benefits aren’t just theoretical. Rhode Island, where the courts have been using these kinds of algorithms for the last eight years, has achieved a 17 per cent reduction in prison populations and a 6 per cent drop in recidivism rates. That’s hundreds of low-risk offenders who aren’t unnecessarily stuck in prison, hundreds of crimes that haven’t been committed. Plus, given that it costs over £30,000 a year to incarcerate one prisoner in the UK26 – while in the United States spending a year in a high-security prison can cost about the same as going to Harvard27 – that’s hundreds of thousands of taxpayers’ money saved. It’s a win–win for everyone.
Or is it?
Finding Darth Vader
Of course, no algorithm can perfectly predict what a person is going to do in the future. Individual humans are too messy, irrational and impulsive for a forecast ever to be certain of what’s going to happen next. They might give better predictions, but they will still make mistakes. The question is, what happens to all the people whose risk scores are wrong?
There are two kinds of mistake that the algorithm can make. Richard Berk, a professor of criminology and statistics at the University of Pennsylvania and a pioneer in the field of predicting recidivism, has a noteworthy way of describing them.
‘There are good guys and bad guys,’ he told me. ‘Your algorithm is effecti
vely asking: “Who are the Darth Vaders? And who are the Luke Skywalkers?”’
Letting a Darth Vader go free is one kind of error, known as a false negative. It happens whenever you fail to identify the risk that an individual poses.
Incarcerating Luke Skywalker, on the other hand, would be a false positive. This is when the algorithm incorrectly identifies someone as a high-risk individual.
These two kinds of error, false positive and false negative, are not unique to recidivism. They’ll crop up repeatedly throughout this book. Any algorithm that aims to classify can be guilty of these mistakes.
Berk’s algorithms claim to be able to predict whether someone will go on to commit a homicide with 75 per cent accuracy, which makes them some of the most accurate around.28 When you consider how free we believe our will to be, that is a remarkably impressive level of accuracy. But even at 75 per cent, that’s a lot of Luke Skywalkers who will be denied bail because they look like Darth Vaders from the outside.
The consequences of mislabelling a defendant become all the more serious when the algorithms are used in sentencing, rather than just decisions on bail or parole. This is a modern reality: recently, some US states have begun to allow judges to see a convicted offender’s calculated risk score while deciding on their jail term. It’s a development that has sparked a heated debate, and not without cause: it’s one thing calculating whether to let someone out early, quite another to calculate how long they should be locked away in the first place.
Part of the problem is that deciding the length of a sentence involves consideration of a lot more than just the risk of a criminal re-offending – which is all the algorithm can help with. A judge also has to take into account the risk the offender poses to others, the deterrent effect the sentencing decision will have on other criminals, the question of retribution for the victim and the chance of rehabilitation for the defendant. It’s a lot to balance, so it’s little wonder that people raise objections to the algorithm being given too much weight in the decision. Little wonder that people find stories like that of Paul Zilly so deeply troubling.29