by Hannah Fry
Zilly was convicted of stealing a lawnmower. He stood in front of Judge Babler in Baron County, Wisconsin, in February 2013, knowing that his defence team had already agreed to a plea deal with the prosecution. Both sides had agreed that, in his case, a long jail term wasn’t the best course of action. He arrived expecting the judge to simply rubber-stamp the agreement.
Unfortunately for Zilly, Wisconsin judges were using a proprietary risk-assessment algorithm called COMPAS. As with the Idaho budget tool in the ‘Power’ chapter, the inner workings of COMPAS are considered a trade secret. Unlike the budget tool, however, the COMPAS code still isn’t available to the public. What we do know is that the calculations are based on the answers a defendant gives to a questionnaire. This includes questions such as: ‘A hungry person has a right to steal, agree or disagree?’ and: ‘If you lived with both your parents and they separated, how old were you at the time?’30 The algorithm was designed with the sole aim of predicting how likely a defendant would be to re-offend within two years, and in this task had achieved an accuracy rate of around 70 per cent.31 That is, it would be wrong for roughly one in every three defendants. None the less, it was being used by judges during their sentencing decisions.
Zilly’s score wasn’t good. The algorithm had rated him as a high risk for future violent crime and a medium risk for general recidivism. ‘When I look at the risk assessment,’ Judge Babler said in court, ‘it is about as bad as it could be.’
After seeing Zilly’s score, the judge put more faith in the algorithm than in the agreement reached by the defence and the prosecution, rejected the plea bargain and doubled Zilly’s sentence from one year in county jail to two years in a state prison.
It’s impossible to know for sure whether Zilly deserved his high-risk score, although a 70 per cent accuracy rate seems a remarkably low threshold to justify using the algorithm to over-rule other factors.
Zilly’s case was widely publicized, but it’s not the only example. In 2003, Christopher Drew Brooks, a 19-year-old man, had consensual sex with a 14-year-old girl and was convicted of statutory rape by a court in Virginia. Initially, the sentencing guidelines suggested a jail term of 7 to 16 months. But, after the recommendation was adjusted to include his risk score (not built by COMPAS in this case), the upper limit was increased to 24 months. Taking this into account, the judge sentenced him to 18 months in prison.32
Here’s the problem. This particular algorithm used age as a factor in calculating its recidivism score. Being convicted of a sex offence at such a young age counted against Brooks, even though it meant he was closer in age to the victim. In fact, had Brooks been 36 years old (and hence 22 years older than the girl) the algorithm would have recommended that he not be sent to prison at all.33
These are not the first examples of people trusting the output of a computer over their own judgement, and they won’t be the last. The question is, what can you do about it? The Supreme Court of Wisconsin has its own suggestion. Speaking specifically about the danger of judges relying too heavily on the COMPAS algorithm, it stated: ‘We expect that circuit courts will exercise discretion when assessing a COMPAS risk score with respect to each individual defendant.’34 But Richard Berk suggests that might be optimistic: ‘The courts are concerned about not making mistakes – especially the judges who are appointed by the public. The algorithm provides them a way to do less work while not being accountable.’35
There’s another issue here. If an algorithm classifies someone as high risk and the judge denies them their freedom as a result, there is no way of knowing for sure whether the algorithm was seeing their future accurately. Take Zilly. Maybe he would have gone on to be violent. Maybe he wouldn’t have. Maybe, being labelled as a high-risk convict and being sent to a state prison set him on a different path from the one he was on with the agreed plea deal. With no way to verify the algorithm’s predictions, we have no way of knowing whether the judge was right to believe the risk score, no way of verifying whether Zilly was in fact a Vader or a Skywalker.
This is a problem without an easy solution. How do you persuade people to apply a healthy dose of common sense when it comes to using these algorithms? But even if you could, there’s another problem with predicting recidivism. Arguably the most contentious of all.
Machine bias
In 2016, the independent online newsroom ProPublica, which first reported Zilly’s story, looked in detail at the COMPAS algorithm and reverse-engineered the predictions it had made on the futures of over seven thousand real offenders in Florida,36 in cases dating from 2013 or 2014. The researchers wanted to check the accuracy of COMPAS scores by seeing who had, in fact, gone on to re-offend. But they also wanted to see if there was any difference between the predicted risks for black and white defendants.
While the algorithm doesn’t explicitly include race as a factor, the journalists found that not everyone was being treated equally within the calculations. Although the chances of the algorithm making an error were roughly the same for black or white offenders overall, it was making different kinds of mistakes for each racial group.
If you were one of the defendants who didn’t get into trouble again after their initial arrest, a Luke Skywalker, the algorithm was twice as likely to mistakenly label you as high risk if you were black as it was if you were white. The algorithm’s false positives were disproportionately black. Conversely, of all the defendants who did go on to commit another crime within the two years, the Darth Vaders, the white convicts were twice as likely to be mistakenly predicted as low risk by the algorithm as their black counterparts. The algorithm’s false negatives were disproportionately white.
Unsurprisingly, the ProPublica analysis sparked outrage across the US and further afield. Hundreds of articles were written expressing sharp disapproval of the use of faceless calculations in human justice, rebuking the use of imperfect, biased algorithms in decisions that can have such a dramatic impact on someone’s future. Many of the criticisms are difficult to disagree with – everyone deserves fair and equal treatment, regardless of who assesses their case, and the ProPublica study doesn’t make things look good for the algorithm.
But let’s be wary of our tendency to throw ‘imperfect algorithms’ away for a moment. Before we dismiss the use of algorithms in the justice system altogether, it’s worth asking: What would you expect an unbiased algorithm to look like?
You’d certainly want it to make equally accurate predictions for black and white defendants. It also seems sensible to demand that what counts as ‘high risk’ should be the same for everyone. The algorithm should be equally good at picking out the defendants who are likely to re-offend, whatever racial (or other) group they belong to. Plus, as ProPublica pointed out, the algorithm should make the same kind of mistakes at the same rate for everyone – regardless of race.
None of these four statements seems like a particularly grand ambition. But there is, nevertheless, a problem. Unfortunately, some kinds of fairness are mathematically incompatible with others.
Let me explain. Imagine stopping people in the street and using an algorithm to predict whether each person will go on to commit a homicide. Now, since the vast majority of murders are committed by men (in fact, worldwide, 96 per cent of murderers are male)37, if the murderer-finding algorithm is to make accurate predictions, it will necessarily identify more men than women as high risk.
Let’s assume our murderer-detection algorithm has a prediction rate of 75 per cent. That is to say, three-quarters of the people the algorithm labels as high risk are indeed Darth Vaders.
Eventually, after stopping enough strangers, you’ll have 100 people flagged by the algorithm as potential murderers. To match the perpetrator statistics, 96 of those 100 will necessarily be male. Four will be female. There’s a picture below to illustrate. The men are represented by dark circles, the women shown as light grey circles.
Now, since the algorithm predicts correctly for both men and women at the same rate of 75 per cent, o
ne-quarter of the females, and one-quarter of the males, will really be Luke Skywalkers: people who are incorrectly identified as high risk, when they don’t actually pose a danger.
Once you run the numbers, as you can see from the second image here, more innocent men than innocent women will be incorrectly accused, just by virtue of the fact that men commit more murder than women.
This has nothing to do with the crime itself, or with the algorithm: it’s just a mathematical certainty. The outcome is biased because reality is biased. More men commit homicides, so more men will be falsely accused of having the potential to murder.fn2
Unless the fraction of people who commit crimes is the same in every group of defendants, it is mathematically impossible to create a test which is equally accurate at prediction across the board and makes false positive and false negative mistakes at the same rate for every group of defendants.
Of course, African Americans have been subject to centuries of terrible prejudice and inequality. Because of this, they continue to appear disproportionately at lower levels on the socio-economic scale and at higher levels in the crime statistics. There is also some evidence that – at least for some crimes in the US – the black population is disproportionately targeted by police. Marijuana use, for instance, happens in blacks and whites at the same rate, and yet arrest rates in African Americans can be up to eight times higher.38 Whatever the reason for the disparity, the sad result is that rates of arrest are not the same across racial groups in the United States. Blacks are re-arrested more often than whites. The algorithm is judging them not on the colour of their skin, but on the all-too-predictable consequences of America’s historically deeply unbalanced society. Until all groups are arrested at the same rate, this kind of bias is a mathematical certainty.
That is not to completely dismiss the ProPublica piece. Its analysis highlights how easily algorithms can perpetuate the inequalities of the past. Nor is it to excuse the COMPAS algorithm. Any company that profits from analysing people’s data has a moral responsibility (if not yet a legal one) to come clean about its flaws and pitfalls. Instead, Equivant (formerly Northpointe), the company that makes COMPAS, continues to keep the insides of its algorithm a closely guarded secret, to protect the firm’s intellectual property.39
There are options here. There’s nothing inherent in these algorithms that means they have to repeat the biases of the past. It all comes down to the data you give them. We can choose to be ‘crass empiricists’ (as Richard Berk puts it) and follow the numbers that are already there, or we can decide that the status quo is unfair and tweak the numbers accordingly.
To give you an analogy, try doing a Google Images search for ‘maths professor’. Perhaps unsurprisingly, you’ll find the vast majority of images show white middle-aged men standing in front of chalk-covered blackboards. My search returned only one picture of a female in the top twenty images, which is a depressingly accurate reflection of reality: around 94 per cent of maths professors are male.40 But however accurate the results might be, you could argue that using algorithms as a mirror to reflect the real world isn’t always helpful, especially when the mirror is reflecting a present reality that only exists because of centuries of bias. Now, if it so chose, Google could subtly tweak its algorithm to prioritize images of female or non-white professors over others, to even out the balance a little and reflect the society we’re aiming for, rather than the one we live in.
It’s the same in the justice system. Effectively, using an algorithm lets us ask: what percentage of a particular group do we expect to be high risk in a perfectly fair society? The algorithm gives us the option to jump straight to that figure. Or, if we decide that removing all the bias from the judicial system at once isn’t appropriate, we could instead ask the algorithm to move incrementally towards that goal over time.
There are also options in how you treat defendants with a high-risk score. In bail, where the risk of a defendant failing to appear at a future court date is a key component of the algorithm’s prediction, the standard approach is to deny bail to anyone with a high-risk score. But the algorithm could also present an opportunity to find out why someone might miss a court date. Do they have access to suitable transport to get there? Are there issues with childcare that might prevent them attending? Are there societal imbalances that the algorithm could be programmed to alleviate rather than exacerbate?
The answers to these questions should come from the forums of open public debate and the halls of government rather than the boardrooms of private companies. Thankfully, the calls are getting louder for an algorithmic regulating body to control the industry. Just as the US Food and Drug Administration does for pharmaceuticals, it would test accuracy, consistency and bias behind closed doors and have the authority to approve or deny the use of a product on real people. Until then, though, it’s incredibly important that organizations like ProPublica continue to hold algorithms to account. Just as long as accusations of bias don’t end in calls for these algorithms to be banned altogether. At least, not without thinking carefully about what we’d be left with if they were.
Difficult decisions
This is a vitally important point to pick up on. If we threw the algorithms away, what kind of a justice system would remain? Because inconsistency isn’t the only flaw from which judges have been shown to suffer.
Legally, race, gender and class should not influence a judge’s decision. (Justice is supposed to be blind, after all.) And yet, while the vast majority of judges want to be as unbiased as possible, the evidence has repeatedly shown that they do indeed discriminate. Studies within the US have shown that black defendants, on average, will go to prison for longer,41 are less likely to be awarded bail,42 are more likely to be given the death penalty,43 and once on death row are more likely to be executed.44 Other studies have shown that men are treated more severely than women for the same crime,45 and that defendants with low levels of income and education are given substantially longer sentences.46
Just as with the algorithm, it’s not necessarily explicit prejudices that are causing these biased outcomes, so much as history repeating itself. Societal and cultural biases can simply arise as an automatic consequence of the way humans make decisions.
To explain why, we first need to understand a few simple things about human intuition, so let’s leave the courtroom for a moment while you consider this question:
A bat and a ball cost £1.10 in total.
The bat costs £1 more than the ball.
How much does the ball cost?
This puzzle, which was posed by the Nobel Prize winning economist and psychologist Daniel Kahneman in his bestselling book Thinking, Fast and Slow,47 illustrates an important trap we all fall into when thinking.
This question has a correct answer that is easy to see on reflection, but also an incorrect answer that immediately springs to mind. The answer you first thought of, of course, was 10p.fn3
Don’t feel bad if you didn’t get the correct answer (5p). Neither did 71.8 per cent of judges when asked.48 Even those who do eventually settle on the correct answer have to resist the urge to go with their first intuition.
It’s the competition between intuition and considered thought that is key to our story of judges’ decisions. Psychologists generally agree that we have two ways of thinking. System 1 is automatic, instinctive, but prone to mistakes. (This is the system responsible for the answer of 10p jumping to mind in the puzzle above.) System 2 is slow, analytic, considered, but often quite lazy.49
If you ask a person how they came to a decision, it is System 2 that will articulate an answer – but, in the words of Daniel Kahneman, ‘it often endorses or rationalizes ideas and feelings that were generated by System 1’.50
In this, judges are no different from the rest of us. They are human, after all, and prone to the same whims and weaknesses as we all are. The fact is, our minds just aren’t built for robust, rational assessment of big, complicated problems. We can’t easily weigh up the
various factors of a case and combine everything together in a logical manner while blocking the intuitive System 1 from kicking in and taking a few cognitive short cuts.
When it comes to bail, for instance, you might hope that judges were able to look at the whole case together, carefully balancing all the pros and cons before coming to a decision. But unfortunately, the evidence says otherwise. Instead, psychologists have shown that judges are doing nothing more strategic than going through an ordered checklist of warning flags in their heads. If any of those flags – past convictions, community ties, prosecution’s request – are raised by the defendant’s story, the judge will stop and deny bail.51
The problem is that so many of those flags are correlated with race, gender and education level. Judges can’t help relying on intuition more than they should; and in doing so, they are unwittingly perpetuating biases in the system.
And that’s not all. Sadly, we’ve barely scratched the surface of how terrible people are at being fair and unbiased judges.
If you’ve ever convinced yourself that an extremely expensive item of clothing was good value just because it was 50 per cent off (as I regularly do), then you’ll know all about the so-called anchoring effect. We find it difficult to put numerical values on things, and are much more comfortable making comparisons between values than just coming up with a single value out of the blue. Marketers have been using the anchoring effect for years, not only to influence how highly we value certain items, but also to control the quantities of items we buy. Like those signs in supermarkets that say ‘Limit of 12 cans of soup per customer’. They aren’t designed to ward off soup fiends from buying up all the stock, as you might think. They exist to subtly manipulate your perception of how many cans of soup you need. The brain anchors with the number 12 and adjusts downwards. One study back in the 1990s showed that precisely such a sign could increase the average sale per customer from 3.3 tins of soup to 7.52