Humble Pi

Page 11

by Matt Parker

A lot of people did know why, though. That comment quickly disappeared from the online version, with a footnote explaining that ‘A number of readers have since noted that 256 is one of the most important numbers in computing.’ I feel sorry for whoever was staffing their Twitter account that afternoon.

I call this the ‘brick wall solution’. If you’re in a WhatsApp group with 256 people (you and 255 friends) and you try to add a 257th person, you will simply be stopped from doing so. But given you’re pretty much claiming to have 255 better friends than them, they’re probably a tenuous enough associate that they’re not going to take it personally. The threat of a roll-over error is also why the game of Minecraft has a maximum height limit of 256 blocks. Which is an actual brick-wall solution.

A different way to deal with roll-overs is to loop around so that 00000000 follows 11111111. This is exactly what happens in Civilization and on Swiss railways. But in both of those cases there were unintended knock-on effects. Computers just blindly follow the rules they are given and do the ‘logical’ thing, with no regard for what may be the ‘reasonable’ thing. This means that writing computer code involves trying to account for every possible outcome and making sure the computer has been told what to do. Yes, programming requires being numerate, but in my opinion it is the ability to think logically through scenarios which most unites programmers with mathematicians.

The programmers behind the original arcade version of Pac-Man had set the level number to be stored as an 8-digit binary number which would loop around when it rolled over. But they forgot to follow through all the consequences of that decision, and a convoluted chain of computer glitches are initiated on level 256, causing the game to fall apart. Not that it’s a big loss: even having 255 working levels feels a bit like overkill, seeing how most people only ever see the first one. But for those with time and coins to spend, there are hundreds of levels to explore (admittedly, they are all identical, apart from the behaviour of the ghosts). That said, my best is level 7. I need to up my game.

Level 256 is all out of whack. Or, strictly speaking, out of waka waka waka.

The game does not fail on level 256 because it cannot store the level number. As always, programmers start counting from zero, so level 1 is stored as index 0, level 2 is index 1, and so on (I’ll use ‘index’ to refer to the number stored, as opposed to the actual level number). Level 256 is stored as index 255, which is 11111111 in binary. No problem. Even moving on to level 257 would just roll the index over to zero and drop Pac-Man back into the first maze. The game should be playable for ever. So why does level 256 break?

The problem is the fruit. To add some variety to Pac-Man’s diet of dots and ghosts there are eight different types of fruit dropped in twice per level (including a bell and a key which Pac-Man seems to eat with the same ease as he does an apple or a strawberry). Each level is assigned a specific fruit, which is shown at the bottom of the screen, along with Pac-Man’s recent fruit consumption. It is this ornamental fruit parade that causes the complete meltdown of the game.

Digital space was at such a premium in old computer systems that there are only three numbers stored in the game of Pac-Man as you play: what level you are on, how many lives remain and what your score is. Everything else is wiped clean between levels. At every level you are playing against ghosts with amnesia who have no recollection of the hours you have already been doing battle. So the game needs to be able to reconstruct from scratch what fruit Pac-Man must have consumed recently. There is only room to depict seven pieces of fruit, so the game needs to show the fruit from the current level and up to six levels before that (depending on how many levels have been played).

In the computer memory there is a menu of the fruit and the order in which it can appear. So, if the level is below 7, it draws as many pieces of fruit as the level number (above that, it draws the most recent seven). The problem occurs when the code takes the level index and converts it into a level number by adding one. Level 256 is index 255, which gets increased by one to be … level 0. Zero is below seven, so it tries to draw as many pieces of fruit as the level number. Which would be fine if it drew zero pieces of fruit but, sadly, it draws first and counts second. The code would draw fruit and then subtract one from the level number until it hit zero.

draw fruit

subtract 1 from level number

stop if level number is zero

otherwise KEEP ON FRUITING

This is not actual Pac-Man computer code. But you get the idea.

The computer is now going to try to draw 256 pieces of fruit instead of the normal seven or fewer. Well, I say ‘fruit’, but the fruit menu runs out after only twenty rows. For the twenty-first piece of fruit the code looks at the next piece of the computer’s memory and tries to interpret it as a piece of fruit. It then keeps rolling through the memory as if it were some exotic table of alien fruit and draws it all as best it can. Some of this does match other symbols to be displayed in the game and so, as well as colourful noise, the screen is filled with letters and punctuation marks.

Because of a quirk of the Pac-Man coordinate system, after the fruit fills the bottom of the screen right to left, it then moves to the top-right corner of the screen and starts filling the screen column by column. By the time 256 pieces of fruit have been drawn, half the screen is completely covered. Unbelievably, the game then starts to play the level, but the system does not complete a level until Pac-Man has eaten 244 dots. On this last, broken level, the mutant fruit has obliterated loads of the dots, so Pac-Man can never eat the required 244 dots and is doomed to wander what is left of his broken maze until boredom sets in and he succumbs to the ghosts pursuing him. Which, coincidentally, is almost exactly how a lot of programmers feel as they try to finish writing their code.

Deadly code

The most dangerous 256 error I have found so far occurred in the Therac-25 medical radiation machine. This was designed to treat cancer patients with bursts of either an electron beam or intense X-rays. It was able to achieve both types of radiation from the one machine by either producing a low-current electron beam which the patient was directly exposed to, or a high-current electron beam which was aimed at a metal plate to produce X-rays.

The danger was that the beam of electrons required to produce X-rays was so powerful that it could do severe damage to a patient if it hit them directly. So if the electron beam’s power was increased, it was vital to make sure the metal target and a collimator (a filter to shape the X-ray beam) had been placed in between the electron beam and the patient.

For this, and a host of other safety reasons, the Therac-25 looped through a piece of set-up code, and only if all the systems are verified as being in the correct settings could the beam be turned on. The software had a number stored with the catchy name of Class3 (that’s just how creative programmers can be when naming their variables). Only after the Therac-25 machine had verified that everything was safe would it set Class3 = 0.

To make sure that it was checked every time, the set-up loop code would add one to Class3 at the beginning of each loop so it started at non-zero. A subroutine with the slightly better name of Chkcol would activate whenever Class3 was not zero and then check the collimator: after the collimator (and the target metal) was checked and seen to be in the right place, Class3 could be set to zero and the beam could be fired.

Unfortunately, the Class3 number was stored as an 8-digit binary number which would roll over back to zero after it had maxed out. And the set-up loop would be running over and over while waiting for everything to be ready, incrementing Class3 each time it ran. So every 256th time the set-up loop ran, Class3 would be set to zero, not because the machine was safe but merely because the value had rolled over from 255 back to zero.

This means that roughly 0.4 per cent of the time a Therac-25 machine would skip running Chkcol because Class3 was already set to zero, as if the collimator had already been checked and verified as being in the correct position. For a mistake
with such deadly consequences, 0.4 per cent is a terrifying amount of time.

On 17 January 1987 in Yakima Valley Memorial Hospital in Washington State, US (now Virginia Mason Memorial), a patient was due to receive eighty-six rads from a Therac-25 machine (rads is an antiquated unit of radiation absorption). Before the patient was to receive their dose of X-rays, however, the metal target and collimator had been moved out of the way so the machine could be aligned using normal visible light. They were not put back.

The operator hit the ‘set’ button on the machine at the exact moment Class3 had rolled over to zero, Chkcol was not run and the electron beam fired with no target or collimator in place. Instead of 86 rads, the patient may have received around 8,000 to 10,000 rads. He died in April that year from complications because of this radiation overdose.

The fix to the software was disturbingly simple: the set-up loop was rewritten so it would set Class3 to a specific non-zero value each time instead of incrementing its previous value. It’s a sobering thought that neglecting the way computers keep track of numbers can result in preventable deaths.

Things computers do not excel at

What is 5 − 4 − 1? It’s not a trick question: the answer is zero; and it’s not always as easy as it looks. Excel can get this wrong. The system of binary digits used by computers to store numbers in digital memory not only causes roll-over errors but can break even the easiest-looking maths.

If I change that 5 − 4 − 1 to be 0.5 − 0.4 − 0.1, the correct answer is still zero, but the version of Excel I’m using thinks that it is −2.77556E-17. And while −0.0000000000000000277556 may not be exactly zero, it is still exceedingly close to zero. So Excel is doing something right. But something is also going fundamentally wrong.

Ah, well, this is awkward.

In short, some numbers cause different base-system grief. Our human base-10 numbers are terrible at dealing with thirds. But we’ve become used to it and can compensate. Quick maths: what’s 1 − 0.666666 − 0.333333? Your instinct may be that it is zero, because 1 − ⅔ − ⅓ = 0. But those digits do not actually represent ⅔ and ⅓ because, in their true form, they require infinitely many sixes and threes. The real answer is 0.000001, which is slightly non-zero because I had only limited space for the decimal versions of ⅔ and ⅓. If you add 0.666666 and 0.333333 you get only 0.999999, not 1.

Binary has the same problem trying to store some fractions. Adding 0.4 to 0.1 does not give you 0.5 in binary, but rather:

0.4 = 0 . 0 1 1 0 0 1 1 0 0 1 1 0 0 …

0.1 = 0 . 0 0 0 1 1 0 0 1 1 0 0 1 1 …

0.4 + 0.1 = 0 . 0 1 1 1 1 1 1 1 1 1 1 1 1 …

A computer cannot store the infinitely many digits of the binary versions of 0.1 and 0.4, so their total is just short of ½.fn1 But just as humans we have become accustomed to the limitations of base-10, computers have been programmed to correct the mistakes introduced by binary calculations.

If you just enter = 0.5 − 0.4 − 0.1 into Excel, it will get it right. It knows that the total of 0.0111111 … should be exactly ½. However, if you enter = (0.5 − 0.4 − 0.1)*1, that freezes the error in place. Excel does not check for these sorts of error during the calculation, only at the end. By making the final step an innocuous multiplication by one, we’ve lulled Excel into a false sense of security. And so it doesn’t scrutinize the answer before releasing it for us to see.

The programmers of Excel claim they are not directly to blame. They adhere to the Institute of Electrical and Electronics Engineers’ standards for arithmetic done by computers, with only a few minor variations in how they handle unusual cases. The IEEE set out standard 754 in 1985 (most recently updated in 2008) to agree how computers should ideally deal with the limitations of doing maths with finite-precision binary numbers.fn2

Because it is baked into the standards, you will see the same kind of problem popping up whenever you get a computer to do some maths for you. Including in a modern phone. Imagine you’re planning a schedule. What would you do if you needed to know how many fortnights there are in seventy-five days? Most people would reach for a calculator app. But I can guarantee that you are better at solving this problem than a calculator.

Grab your phone and open the calculator app. If you enter 75 ÷ 14, the answer 5.35714286 … will appear instantly on the screen. So seventy-five days is just over five fortnights. To work out how many extra days there are, subtract five and multiply the remaining 0.35714286 … of a fortnight by fourteen. What your calculator now shows you is wrong.

On some phones you will be looking at the answer 5.00000001, or similar. Other phones give things like 4.9999999994 as the result. iPhone owners will see the correct answer, 5, but this gives it no right to feel smug; tilt the iPhone sideways so it goes into scientific calculator mode and in older versions of iOS the full answer will be revealed: 4.9999999999. I’ve just fired up the calculator program on my computer and it gives an answer of 5.00000000000004. Because of the limitations of binary, computers are consistently close, but not quite. Like any food product with ‘diet’ in the title, it’s always a bit off.

The dangers of truncation

In the life-and-death theatre of war, any simple mistake can result in the loss of many lives. And while wars are inextricably entangled with politics, I think we can still objectively examine how otherwise small maths errors can have disastrous results in terms of the cost in human life. Even a maths mistake as small as a 0.00009536743164 per cent error.

On 25 February 1991, in the First Gulf War a Scud missile was fired at US army barracks near Dhahran, Saudi Arabia. This was not a surprise for the US Army, who had set up a ‘Patriot missile defense system’ to detect, track and intercept any such missiles. Using radar, the Patriot would detect an incoming missile, calculate its speed and use that to track its movements until a counter-missile could be fired to destroy it. Except a mathematical oversight in the Patriot code meant that it missed.

Originally designed as a portable system to intercept enemy planes, the Patriot battery had been updated in time for the Gulf War so that it could defend against the much faster Scud missiles, which could travel at blistering speeds of around 6,000km/h. The Gulf War Patriots were also placed in static positions instead of being moved around a lot like they had been designed to do.

Remaining stationary meant that the Patriot systems were not routinely turned on and off (this, as we have already seen, can lead to some issues with internal timekeeping). The system used a 24-digit binary number (3 bytes) to store the time in tenths of a second since it was last turned on, meaning it could run for 19 days, 10 hours, 2 minutes and 1.6 seconds before there would be a roll-over error. Which must have seemed a long time when they were being designed.

The problem was how that number of tenths of a second was converted into a floating-point value of exact seconds. The maths for this is easy enough: you multiply by 0.1 to effectively divide by ten. But the Patriot system stored 1/10 as a 24-bit binary number, creating exactly the same problem Excel has when you subtract 0.4 and 0.1 from 0.5: it’s off by a tiny amount.

0.00011001100110011001100 (base-2) = 0.099999904632568359375 (base-10)

0.1 − 0.099999904632568359375 = 0.000000095367431640625

This is the absolute error per 0.1 second, an error of 0.000095367431640625 per cent.

An error of 0.000095 per cent may not feel like much; it is only off by one part in a million. And when the time value is small, the error is also small. But the problem with a percentage error is that, as the value gets bigger, the error grows with it. The longer the Patriot system was running, the larger the time value became and the bigger the error accumulated was. When the Scud missile was launched that day the nearby Patriot system had been on for about a hundred continuous hours, roughly 360,000 seconds. That’s about a third of a million seconds. So the error was about a third of a second.

A third of second does not feel very long until you’re tracking a missile going 6,000km/h. In a third of a second a Scud
missile can move more than 500 metres. It is very hard to track and intercept something which is half a kilometre away from where you expect it to be.

The Patriot system was unable to stop the Scud missile and it hit the US base, killing twenty-eight soldiers and injuring about a hundred other people.

It’s yet another costly lesson in the importance of knowing the limits of binary numbers. But this time there is an added lesson when it comes to fixing mistakes. When the system was upgraded to track the much faster Scud missile, the time conversion method had been upgraded as well, but not consistently. Some time conversions in the system still used the old method.

Ironically, if the system had consistently been off from the correct time, it could still have worked okay. Tracking a missile requires accurate tracking of time differences, so a consistent error would cancel out. But now different parts of the system were using different levels of precision in their conversion and a discrepancy slipped in. The incomplete upgrade is why the system could not track the incoming missile.

Even more depressing is that the US Army knew about this problem and, on 16 February 1991, it had released a new version of the software to fix it. As this would take a while to distribute to all the Patriot systems, a message was also sent out to warn Patriot users not to let the system run continuously for long periods of time. But what constituted a ‘long period of time’ was not specified. As well as the mathematical problems, those twenty-eight deaths were also the result of poorly fixed code and the lack of a message simply saying to restart once a day.

‹ Prev Next ›