The Tiger That Isn't Page 17 Read online free by Andrew Dilnot

Home > Other > The Tiger That Isn't > Page 17

The Tiger That Isn't Page 17

That all sounds unlikely. Is there a more plausible explanation? One simple possibility is that having an illness and having it diagnosed are different (not everyone finds their way to the doctor at the same point). Perhaps it is not that three times as many have the illness in the US, but that nearly three times as many are diagnosed.

It's among insidious differences that comparison comes unstuck. Rudy's comparison was ingenious, but inumerate. Bump up the number of diagnoses, when deaths remain about the same, et voila!, there is your massively higher 'survival' rate. This graph says only a little about the effectiveness of the treatment for prostate cancer in the two countries; it hints at much more about the trend in the US to early diagnosis.

Figure 10 Prostate cancer incidence and mortality per 100,000 males per year

In fact, the US has genuine reason to feel satisfied, beating the UK on most international comparisons of cancer treatment, so far as those comparisons can be trusted. Even this one suggests that fewer people die from prostate cancer in the US than in the UK: 26 per 100,000 men, compared with 28 per 100,000. Not twice as good, as Rudy claimed, nor anything like it, but better, and that result might owe something to those higher rates of diagnosis and the fashion in the US for health screening from a younger age. Though it might also be because survival is defined as living for five years beyond diagnosis. So if people are diagnosed earlier, they probably have more years left to live anyway and so appear to have a better survival rate even if the doctors do nothing.

A convoluted argument, you would probably agree. The ifs and buts pile up, different cultural practices raise unanswerable questions. But that is the point. Comparison is seldom straightforward once you start to dig a little.

All in all, since there's some evidence to suggest that Americans really do seem to have more cancer than others, even after adjusting for the propensity to early diagnosis, the American performance might well be better than most other countries, but probably only a little better. Though, incidentally, diagnosis is not always a blessing. If it leads to treatment, the side effects can include infertility, impotence and incontinence. Since more people die with prostate cancer than from it, doing nothing will in many cases do no harm, and might even prevent some.

'Eight out of ten survive', 'four out of five prefer', 'one in four this', '99 per cent that'… all apparently simple forms of counting turned into a comparison by words such as 'unlike over there, where only 70 per cent …' etc.

But '8 out of 10' what? Out of all those who have prostate cancer, or only those whose cancer comes to the attention of a doctor? Rudy's comparison fails because he picks his survivors from different groups, the more frequently diagnosed US group with the less frequently diagnosed UK group. It's a tasty little ploy, though whether in this case accidental or deliberate, who knows?

Often it takes a moment to work out where the fault lies in a bogus comparison, but some are so brazen as to affront public intelligence, making it hard to fathom how the perpetrators thought they would ever get away with it.

So it was after September 2003, when a teenager, Peter Williams, attacked Victor Bates with a crowbar in the family jeweller's shop while his accomplice shot and killed Mr Bates' wife, Marian, as she shielded her daughter. Twenty days earlier Williams had been released from a young offenders' institution. Twenty months later he was convicted as an accomplice to murder. He was supposedly under a curfew order and electronically tagged at the time of the murder but, in the short time since his release, had breached the order repeatedly.

In autumn 2006 it was revealed in reports by the National Audit Office and the House of Commons Public Accounts Committee that, since 1999, convicts on tag, as they are called, had committed 1,000 violent crimes and killed five people. Tagging was denounced in parts of the media as an ineffective, insecure alternative to prison, putting the public at risk and used only, it was alleged, because in comparison to prison it was cheap.

Williams had been under an intensive surveillance and supervision order requiring him to be tagged. The more common use of tagging is for what are known as home detention curfews, designed for non-violent criminals and allowing them to be released from prison up to four and a half months early.

The defence of tagging was brisk, assertive and dependent entirely on spurious comparison. It was argued that of the 130,000 people who had been through the scheme, the proportion who had committed offences while tagged was around 4 per cent, comparing extremely well, said the Home Secretary and, in separate interviews, a Home Office minister and a former chief inspector of prisons, to a recidivism rate for newly released untagged prisoners of around 67 per cent. Tagging was thus a beacon of success and, while every offence caused grave concern, the scheme merited praise not censure.

Comparison is a fundamental tool of measurement and thence judgement. If we want to know the quality of A, we compare it with B. In the case of criminal justice, the comparison is generally between the statistical consequences of the alternatives: what would have happened, how much more or less crime would there have been if, instead of B, we tried C?

But comparison runs into all manner of booby traps, accidental and deliberate. The principal problem is well known: any comparison, to be legitimate, must be of like with like. And this particular comparison – of reoffending by convicts on tag with reoffending by others – was a model of what to watch out for, of how not to do it, or at least how not to do it honestly; an object lesson in the many ways of bogus comparison.

Let us attempt some definitional clarity. Who, when and what were the two groups being compared?

First, 'who'? The former prisoners and the tagged were not of the same kind, the tagged having been judged suitable for tagging because, prison governors decided, they were least likely to reoffend. Others were considered more risky. So was it tagging, or the selection of people to be tagged, that produced a lower offending rate? The claim of success for the scheme when the people differed so much, and in fact were chosen precisely because they differed, was, to put it politely, lacking rigour.

Second, 'when'? The period of tagging in which the new offences were committed was never more and often less than four and a half months; the period in which the Home Office counts new offences by untagged former prisoners is two years, more than five times as long. This is the second reason we might expect to see a difference in how much reoffending each group commits during the measured period, which has nothing whatsoever to do with the merits of tagging itself.

And third, 'what'? If you want to compare the success of tagging with its alternative, you have to be clear that the alternative to being tagged is to be in prison. It is not, the Home Office take note, to be at liberty. Either you are let out early wearing a tag, or you are not let out at all. These are the two groups that should be compared since these are the alternatives. The latter, being in prison, oddly enough, commits rather little crime against the public (though more against their guards and each other).

The comparison was an open and shut statistical felony, and even prompted an unusually forthright attack from the normally reserved Royal Statistical Society. Either we must compare those who were formerly tagged with those formerly prisoners, but both now free, or we must compare the different ways in which both groups are still serving sentence: the currently tagged with those currently prisoners.

Strange to tell, but the Home Office had not attempted to identify and measure recidivism specifically among people who had once been tagged but were no longer, so we had no idea, and nor did the Home Office ministers, or the former chief inspector of prisons, whether tagging worked better than prison at preventing reoffending once sentence had been served.

Tagging might well be sensible; the basis on which it was defended was anything but: a comparison of two different categories of offender across two different periods in two rather obviously contrasting places, alike only to the most superficial and thoughtless examination, where, despite all these differences, it was claimed to be the sc
heme that made the difference to their offending rates. And this from ministers in a department of state that also runs the police and courts and might, we hope, have some grasp of what constitutes evidence.

With comparison, all the definitional snags of counting are multiplied massively, since we define afresh with every comparison we make. To repeat the well-known essence of the problem: are we comparing like with like in all respects that matter? The comparison of schools, hospitals, police forces, councils, or any of the multitudes of ranked and performance-assessed ought first to be an equal race. But it rarely is, and seldom can be. Life is altogether messier than that, the differences always greater than anticipated, more numerous and significant. So we have to decide, before ignoring these differences, if we are happy with the compromise and the rough justice this implies. The exercise might still be worth it but, before making that call, it pays to understand the trade-offs. Even where intentions are good, the process is treacherous.

The big surprise of the radical shift to public policy through comparison in league tables, performance indicators and the like, an explosion of comparison unparalleled in British administrative history, has been discovering how the categories for comparison seem to multiply. Things just won't lie down and be counted under what politicians hoped would be one heading, but turn out to be complicated, manifold and infernally out of kilter. Counting in such circumstances is prone to an overwhelming doubt: what is really being counted?

For example, the government began by treating all schools as much the same thing when it put them into league tables.

Today those league tables include an elaborate and, for most parents, impenetrable calculation that tries to adjust the results for every single school in the country according to the characteristics of the pupils in it. Though comparison starts by claiming to be a test of merit, it invariably collapses into a spat about underlying differences.

In 1992 school performance tables were introduced in the United Kingdom. Did the government expect still to be making fundamental revisions fifteen years later? Almost certainly it did not. In 2007 performance tables for schools faced their third substantial reform, turning upside down the ranking – and the apparent quality of education on offer – of some schools. Without any significant change in their examination results, many among the good became poor and, among the struggling, some were suddenly judged to excel. Out went the old system of measurement, in came the new. Overnight the public, which for several years had been told one thing, was told another. The government called it a refinement.

The history of school league tables (we offer a much-compressed version below) is a fifteen-year lesson in the pitiless complexity of making an apparently obvious measurement in the service of what seemed a simple political ambition: let's tell parents how their local schools compare. At least, 'simple' is how it struck most politicians at the time. One conclusion could be that governments are also prone to failures to distinguish abstraction from real life, still insisting that counting is child's play.

The first league tables in 1992 were straightforward: every school was listed together with the number of its children who passed five GCSEs at grade C or above. Though this genuinely had the merit of simplicity, it was also soon apparent that schools with a more academically able intake achieved better results, and it wasn't at all clear what, if anything, a school's place in the tables owed to its teaching quality.

For those schools held up as shining examples of the best, this glitch perhaps mattered little. For those fingered as the worst, particularly those with high numbers of special-needs pupils or pupils whose first language was not English, it felt like being obtusely condemned by official insensitivity, and was infuriating.

What's more, the results for any one school moved from year to year often with pronounced effects on the school's league-table position. Professor Harvey Goldstein, formerly of the Institute of Education, now at Bristol University, told us: 'You cannot be very precise about where a school is in a league-table or a ranking. Because you only have a relatively small number of pupils in any one year that you are using to judge a school, there is a large measure of uncertainty – what we call an uncertainty interval – surrounding any numerical estimate you might give. And it turns out that those intervals are extraordinarily large; so large, in fact, that about two thirds to three quarters of all secondary schools, if you are judging by GCSE or A-level results, cannot be separated from the overall national average. In other words there's nothing really that you can say about whether the school is above or below the average of all pupils.'

So the tables were comparing schools that were often unalike in kind, then straining the data to identify differences that might not exist. They were counting naively and comparing the tally with carelessness as to what they were really counting. Some schools, conscious of the effect of the tables on their reputation – whether deserved or not – began playing the system, picking what they considered easier subjects, avoiding mathematics and English, even avoiding pupils – if they could – whom they feared might fail, and concentrating on those who were borderline candidates, while neglecting the weakest and strongest for whom effort produced little reward in the rankings.

So the comparison, which had by now been the centre-piece of two governments' education policies, was revised, to show how much pupils in each school had improved against a benchmark of performance when they were aged 11. This was an attempt to measure how much value the school added to whatever talent the pupil arrived with. But the so-called value-added tables were nothing of the kind and unworthy of the name. (David Blunkett, who was Education Secretary at the time, described them to us as 'unsatisfactory'.) The benchmark used at age 11 was an average of all pupils at each grade. Many selective schools were able to cream off the pupils above the average of their grade and appear, once those pupils reached 16 and were measured again, to have added a huge amount of value to them. In fact the value was there from the start. These tables, misleading and misnamed as they were, were published for four years.

Then another revision was announced, this time to require school results to include mathematics and English in the subjects taken at GCSE, which in one case caused a school in east London to slump from 80 per cent of pupils achieving five passes to a success rate of 26 per cent.

Then there was a third major revision, known as contextual value added (CVA), which admitted the weaknesses of ordinary value added and aimed to address them by making allowance for all sorts of factors outside the school's control which are thought to lower performance – factors such as coming from a poorer background, a first language other than English, having special needs, being a boy, and half a dozen others. CVA also set pupils' performance against a more accurate benchmark of their own earlier ability.

In 2006, prior to the full introduction of CVA in early 2007, a sample of schools was put through the new calculation. What did the change in counting do to their position in the tables? One school, Kesteven and Grantham Girls School, went from 30th in the raw GCSE tables to 317th out of the 370 sampled. Another, St Albans C of E School in Birmingham, travelled in the opposite direction from 344th to 16th. Parents could be forgiven for wondering, in light of all this, what the comparisons of the past fifteen years had told them.

And there, so far, ends the history, but not the controversy. The CVA tables – complicated and loaded with judgements – have moved far from the early ideal of transparent accountability. It also turns out that the confidence intervals (how big the range of possible league-table positions for any school must be before we are 95 per cent sure that the correct one is in there) are still so large that we cannot really tell most of the schools apart, even though they will move round with much drama from one year to the next in the published tables. And in thinking about value added, it has dawned on (nearly) everyone that most schools are good at adding different kinds of value – some for girls, some for boys; some for high achievers, some for low; some in physics, some in English –
but that the single number produced for each school can only be an average of all those differences. Few parents, however, have a child who is so average as to be 50 per cent boy and 50 per cent girl.

One significant repair looks like an accident; three wholesale reconstructions in fifteen years and you would want your money back. Unless, of course, a fair comparison was not really what they were after, but rather a simple signal to say which schools already had the most able children.

Some head teachers report great benefits from the emphasis on performance measurement brought about by league tables, particularly with the new concentration on measures of value added. They have felt encouraged to gather and study data about their pupils and use this to motivate and discuss with them how they might improve. They pay more attention, they say, to the individual's progress, and value the whole exercise highly. That must be welcome and good.

And it would absurd to argue against data. But it is one thing to measure, quite another to wrench the numbers to a false conclusion. Ministers often said that league tables should not be the only source of information about a school, but it is not clear in what sense they contributed anything to a fair comparison of school performance or teaching quality. Make a comparison blithely, too certain of its legitimacy, and we turn information into a lottery. As Einstein is often quoted as saying, 'information is not knowledge'.

‹ Prev Next ›