Bugs can be avoided, and there are ways to do that.
As we already outlined how bugs are created in the previous part of our story, it’s now time to look at the positive side – software failure is not always the norm.
Bugs can be avoided, and there are ways to do that. All successful software projects seem to have something in common. A successful project may be seen as a tripod and all the legs should be in place for the tripod to stand firmly.
In any project, there are four interdependent factors: Cost, Quality, Speed and Risk. Of these factors, in any given project, two are always possible to be achieved successfully, leaving the other two to be managed. The most important two of the four are Risk and Quality. The system must work and successfully meet the requirements. This leaves speed (time) and cost (money) to be adjusted.
It is without any doubt that the larger the software, the more complex it is.Greater complexity comes from the fact that no one in this project really understands all the interacting parts of the whole or has the ability to test them. So, the greater the software, the more complex it is both in its static elements (hardware) and its dynamic elements (the interactions).
Pressman said “exhaustive testing presents certain logical problems…Even a small 100 – line program with some nested paths and a single loop executing less than twenty times may require 10 to the power of 14 possible paths to be executed…To test all of these 100 trillion paths, assuming each could be evaluated in a millisecond, would take 3170 years”.
The software is pervasive in the modern world and yet, we are often unaware of its presence, until problems arise. Due to the fact that it is purely an intellectual product, it is among the most labor-intensive, complex and error-prone technologies in human history.
Until the 1970s, programmers were quite meticulous in planning their code, rigorously checking it, providing detailed documentation, and exhaustive testing before the product is released to customers.
But, as the computers became wide – spread this attitude changed…Nowadays, the attitude of the average programmer usually is hacking sessions or writing any sloppy piece of code and the compiler will run diagonally, a situation known as the “code and fix”, where the programmer tries to fix errors one by one, until the software compiles as expected.
As the programs grew both in size and complexity, the limits of the “code and fix” approach became evident.
That is when the big failures began to happen….
World War III?On the night of September 26th, 1983 the Soviet Defense officer Stanislav Petrov literally saved the world from the third world war.
What happened – the Soviet nuclear early warning system malfunctioned and erroneously reported the USA had launched an attack on the country. Petrov, as he later shared with the Washington Post, had a “funny feeling” in his gut that the alarm was false and that was indeed confirmed by further investigation. A new world war was prevented…by a gut feeling…
The trigger for the near-apocalyptic disaster was traced to a fault in software that was supposed to filter out false missile detections caused by satellites picking up sunlight reflections of cloud – tops.
More precisely, the false alarms (as it has happened and before) were caused by a rare alignment of sunlight on high – altitude clouds and the satellite’s Molniya orbits, an error later corrected by cross-referencing a geostationary satellite.
Black MondayBlack Monday is a collapse in 1987 on Wall Street, when in a single day 500 billion US dollars were lost. The date was 19th of October 1987 when the Dow Jones Industrial Average plummeted 508 points, losing 22.6% of its total value. The S&P dropped by 20.4%.
This was actually the greatest loss Wall Street has ever suffered in a single day.
A long bull market has halted by a rash of SEC (Securities and Exchange Commission – a government body in the USA) investigations of insider trading and by other market forces. As investors fled stocks in a mass exodus, computer trading programs generated a flood of sell orders, overwhelming the market and thus, crashing the systems and leaving investors effectively blind.
The explosion of the Ariane 5In 1996, Europe’s newest and unmanned satellite-launching rocket, the Ariane 5 was intentionally blown up seconds after taking off on its flight from French Guyana. The European Space Agency estimated that the total development cost of the Ariane 5 was more than 8 million US dollars.
What was the reason to be blown? – The self-destruction was triggered by software trying to stuff a 64 – bit number into a 16 – bit space. The shutdown occurred only 36.7 seconds after launch when the guidance system’s own computer tried to convert one piece of data – the sideways velocity of the rocket – from a 64 – bit format to a 16 – bit format. The number was too big, and an overflow error resulted.
Mars Climate Observer Fail in 1998Two spacecraft, the Mars Climate Orbiter, and the Mars Polar Lander, were part of a space program that, in 1998, was supposed to study the Martian weather climate and water and carbon dioxide content of the atmosphere. But a problem occurred when a navigation error caused the lander to fly low in the atmosphere and it was destroyed.
What caused the error? A subcontractor on the NASA program has used the imperial units (as used in the USA), rather than the NASA – specified metric units (as used in Europe). That error was definitely preventable if a proper testing, by the specification, was conducted.
The result? The 125 million US dollar spacecraft attempted to stabilize its orbit too low in the Martian atmosphere and crashed into the red planet. After that failure, special attention is being paid to the units used.
Siemens and the Passport SystemIn the summer of 1999 half a million British citizens were less than happy to discover that their new passports couldn’t be issued on time because the Passport Agency had brought in a new Siemens computer system without sufficiently testing it and training the staff first.
Hundreds of people missed their holidays and the Home Office had to pay millions in compensation, staff overtime, and umbrellas for the people queuing in the rain for passports. That is not a joke.
But why such an expectedly huge demand for passports? The law had recently changed to demand that all children under 16 had to get one if they were traveling abroad – a change for the first time.
It was not common sense to introduce a new system and change the law at the same time.
LA Airport (LAX) flights groundedAround 17 000 planes were grounded at Los Angeles International Airport because of a software problem. The problem that hit systems at the United States Customs and Border Protection agency was a simple one caused in a piece of lowly, inexpensive equipment.
The device in question was a network card that, instead of shutting down as perhaps it should have done, persisted in sending the incorrect data out across the network. The data then cascaded out until it hit the entire network at the agency and brought it to a standstill. Nobody could be authorized either to leave or to enter the USA through the airport for 8 hours.
The deadly radiation therapy – 1985The Therac – 25 medical radiation therapy was involved in several cases where massive overdoses of radiation were administered to patients in the period 1985 – 87, a side effect of the buggy software powering the device. A number of patients received up to 100 times the intended dose, and at least three died as a direct result of it.
The reason – a subtle bug called race condition. As a result of this bug, a technician could accidentally configure Therac – 25, so the electron beam would fire in high – power mode without the proper patient shielding.
An interesting observation for software testers: The condition that caused the failure to occur happened as the result of nonstandard user input on the keyboard. In other words, an average black box tester, running a set of ad hoc cases, likely would have found the bug and saved lives.
The failure of the Therac – 25 is a standard case study in the importance of testing and handling safety critical systems.
National Cancer Institute, Panama, 2000In a series of accidents, therapy planning software created by Multidata Systems International, a US company, miscalculated the proper dosage of radiation for patients undergoing radiation therapy.
Multidata’s software allowed a radiation therapist to draw on a computer screen the placement of the metal shields called “blocks” that are designed to protect healthy tissue from the radiation. But the software was only allowing the technicians to use four shielding blocks, and the doctors from Panama were used to using five blocks. The doctors then discovered that they were able to trick the software by drawing all five blocks as a single large block with a hole in the middle. What the doctors did not realize was that Multidata software gave different answers in this configuration depending on how the hole was drawn – if you drew it in one direction, the correct dosage was calculated, in the other direction – the software recommended twice the necessary exposure.
At least 8 patients died, while 20 other received overdoses that caused serious health problems. The physicians, who were legally required to double check the computer’s software calculations by hand, were and still are indicted for murder.
The name of the software was Cobalt – 60.
St. Mary’s Mercy Medical Center kills its patientsIn February 2003, a computer software blunder at St. Mary’s Mercy Medical Center in Grand Rapids, Michigan, cost the lives of 8500 patients. Well, not really. In reality, each of the patients who had procedures done from October 25th through December 11th of the previous year, were alive and kicking. However, the glitch, attributed to the hospital’s patient management system, notified Social Security, patient insurance companies, and the patients themselves, of the “unfortunate” demises.
The reason – a mapping error, as said by the spokesman of the center.
World of Warcraft creates literal computer virus
On September 12, 2005, there was a mad bunch of geeks when the new creation Hakkar, the God of Blood, hit the WOW and took the whole computer virus literally.
Hakkar hit players with a “Corrupted – Blood” virus that had the ability to instantly kill weaker characters. The virus was supposed to be contained to Hakkar’s kingdom, but through a programming glitch, it was passed on to other parts of the kingdom resulting in a + 1 000 “deaths”.
Blizzard Entertainment, the game’s creator, eventually incorporated some quick fixes along with rolling restarts to negate the Hakkar effect.
Apple Maps go nowhere…fastWith the 2012 Apple iOS 6 update, the company decided to kick the superior Google Maps Platform to the curb in favor of its own system.
Unfortunately, it did a poor job of mapping out location resulting in one of the most epic fails of the mobile computing movement.
The software was missing entries for entire towns, incorrectly placed locations, incorrect locations given for simple queries, satellite imagery obscured by clouds and more.
This is why testing shouldn’t be ignored! And because we know examples stick with people easier than theory, we’ve prepared even more in the next part of our Software Disasters article edition, which will be out soon. Stay tuned!