Know your history, so you don’t repeat the same mistakesThis is also valid when discussing major software failures. We hope looking back at the 20th Century’s most prominent bugs and what they led to might make us even more meticulous when developing and testing software. And besides, they’re great if you want to show off your knowledge in the field… Here comes another portion of notorious software failures:
Patriot Missile System Timing Issue leads to 28 deadOne of the most tragic computer software blunders happened on February 25th, 1991, during the Gulf War.
While the Patriot Missile System was largely successful throughout the conflict, it failed to track and intercept a Scud missile that would strike American barracks.
The software had a delay and was not tracking the missile launch in real time, thus giving Iraq’s missile the opportunity to break through and explode before anything could be done to stop it. In all, 28 were killed with an additional 100+ injured.
Mariner I Space ProbeWhile transcribing a handwritten formula into navigation computer code, a programmer missed a single superscript bar. This single omission caused the navigation computer to treat normal variations as serious errors, causing it to wildly overreact with corrections during launch. To be fair to the programmer, the original formula was written in pencil on a single piece of notebook paper – not exactly the best system for transcribing mission-critical information. The year was 1962 and no serious attention was devoted to the idea of testing and that would definitely have caught the bug…
There were actually two bugs within the Mariner I.
The first was a hardware problem – an antenna was underperforming. So, the spacecraft had to rely on it onboarding guidance system only. And that system had the wrongfully transcribed formula into the code.
What happened? 293 seconds into the mission that was supposed to send Mariner I to Venus, the spacecraft was so far off course that Mission Control had to destroy it over the Atlantic. The cost of the spacecraft was 18.2 million US dollars…in 1962. Five weeks later, the mission of Mariner I was finished by Mariner II.
Hartford Coliseum CollapseThe Hartford Civic Center Coliseum collapsed on January 18th, 1978, just hours after nearly 5 thousand spectators left the place. The steel – latticed roof collapsed under the weight of snow. The cost was total of 90 million US dollars – 70 million US dollars for the Coliseum + 20 million US dollars damage to the local economy.
The computer model assumed all of the top chords were literally braced, but in fact only interior frame met the criteria. Dead loads were underestimated by more than 20%. When one of the supports unexpectedly buckled from the snow, it set off the chain reaction that brought down the other roof sections.
The reason? There were many conflicting accounts of failure, including design flaws, construction and programming errors. The CAD programmer designed the Coliseum incorrectly, assuming that the roof support would only face pure compression.
US Doom daysOn August 14 2003, the biggest power crisis in American history was actually initiated in the bowels of a Unix – based A/ 21 energy – management system.
Deep in the four million lines of C code running the system, there was race condition bug. Race conditions occur when two separate threads on one operation rely on a single element of code. If the process isn’t properly synchronized, the threads get themselves in a self - perpetuating tangle and crash the entire system. On this occasion, data feeds from several network monitors created a “perfect storm” for the race condition and, in a matter of seconds, the incoming data overwhelmed a system that was supposed to alert controllers to problems in the electricity grids.
With the alarm system down, the doughnut-munching controllers remained unaware of relatively minor network events that soon spiraled out of control because they were not quickly resolved. Unprocessed events queued up and the primary server failed within 30 minutes, switching all operations to the backup server, which itself failed minutes later.
Oblivious to the impending nightmare, observers did nothing when a power line tripped out after making contact with the unkempt tree, which forced more power onto another overhead power line, causing that one to sag and trip out too. Within an hour, power lines and circuit breakers were tripping left, right and center, as a power surge cascaded across the northeastern states.
Tripped – outlines caused a sudden drop in demand, bringing generators offline, which immediately caused a power vacuum that was filled by currents surging in from other plants.
It was the electrical equivalent of rush hour, and a major crash was inevitable. The carnage eventually left 256 power plants offline, causing cellular communication and media distribution, the best form of communication was reported to be laptops using dial-up – modems.
AT & T FailureIn 1990, on January 15th the clients of the AT & T suffered a downtime of 9 hours in which no one could make long distance calls. It is believed that the outage happened when a software glitch managed to disable many switches throughout the network. Actually, the network has just undergone a new software installation, which was the reason for the problem.
Once the fault was located, the company installed a previous version to get the angry clients back on the line. Around 60 000 clients were affected by the problem. It is curious to know that at first, the AT & T Company thought it was hacked and that is the reason it took them 9 hours to find the fault. The company was also working with the law enforcement to find the hack.
In the meantime, AT & T lost an estimated 60 million US dollars in long – distance charges from calls that did not go through. The company took a further financial hit a few weeks later when it knocked a third off its regular long – distance rates on Valentine’s Day to make amends.
The reason for the failure was a software upgrade of a switch.
Intel Pentium – sucks in long division and customer service?
In November 1994, the New York Times reported a rather embarrassing issue with the Intel Pentium chips that had affected a variety of PCs.
A number of chips were flawed in a way that prevented them from accurately handling long divisions – unnoticeable mistake to the common PC user, but a big deal for scientists and engineers, who required precise calculations in the handling of their work.
The reaction of Intel was a refusal to recall the chips, stating that it would not affect that many people.
Those who needed the precise division were forced to “prove” why it was important to them.
The problem lay in a faulty math coprocessor, also known as a floating – point unit. That problem did not really mean a big deal for the regular user. That kind of bug has 1 – in 360 billion chance that miscalculations could reach as high as the fourth decimal place. More likely, with odds 1 – to 9 billion against, was that any errors would happen in the 9th or 10th decimal digit.
The one that needed that accuracy was the Virginia – based professor – Thomas Nicely.
He informed Intel in October 1994, then others when Intel replied with a response a little less actual than “Oh, that thing? We noticed in June”.
Thus, the news exploded and became a PR mess turned into a costly mop – up bill.
In January 1995, Intel announced a pretax charge of 475 million US dollars against earnings, most of which apparently stemmed from replacing flawed processors.
Knight Capital – lose it all in 30 minutesIn August 2012, Knight Capital Group, a market-making firm with a stellar reputation in its industry, blew all that just for 30 minutes.
Between 9.30 a.m. and 10.00 a.m. EST on August 1st, the company’s trading algorithms got a little buggy and decided to buy high and sell low on 150 different stocks.
By the time this was stopped, the Knight Capital had lost 440 million US dollars on trades.
By comparison, its market cap was just 296 million US dollars and the loss was four times its 2011 net income.
The reason for the fault – faulty test of the new software.
The trading algorithm of the company was also a bit eccentric. On every stock exchange, there is a ‘bid” and an “ask” price. The bid price is what you’d like to pay to the holder of the stock if you want to buy their shares. The asking price is what they’ll pay to buy those shares from you. There’s always a spread between the two prices, with the “ask” price behind a few cents or more above the “bid”. If the stock is thinly traded, then the spread between the ask and the bid is higher than what you’d see.
Knight Capital’s software went out and bought at the “market”, meaning it paid to ask price and then sold at the bid price – instantly. Over and over again.
One of the stocks the program was trading, electric utility Exelon, had a bid/ ask spread of 15 cents. Knight Capital was trading blocks of Exelon common stock at a rate as high as 40 trades per second – and taking a 15 cent per share loss on each round – trip transaction.
The reasons – accidentally installed test software which incorporated an old piece of code designed 9 years ago. In one out of 8 production servers, new code was not installed by a technician. There was no process for a second technician to review.
Amadeus – smack airports across the world
A network failure affected the Amadeus online booking platform in September 2017.
Amadeus is an airline check-in system that led to delay in airports around the world and booking problems for passengers. The problem occurred on Thursday, 28th of September 2017.
The airports in Frankfurt, Paris, and Zurich reported problems with their computer systems which, happily, were quickly resolved.
London’s Gatwick Airport blamed Altea, a passenger management system that underpins booking systems and is made by the Spanish software company Amadeus.
The most affected airline companies by the network failure were:
Some of the airlines said they were affected by the issue only for a minute, despite it took 3 hours and a half for Amadeus to fix the problem.
Pepsi Bottle Glitch
In 1992, Pepsi – Cola Products Philippines launched a promotional activity dubbed as “Number Fever” and offered prizes up to P1 million for holders of bottle caps with the number 349.
When the 349 was announced on May 25th, 1992, thousands of people claimed the prizes, but the company refused to pay all of them, stating that the caps did not contain the proper security codes.
The prize was 1P million or around 37 000 US dollars per winner. The Pepsi officials claimed that a computer glitch picked the number by mistake, but those who got the bottle caps with the winning number combinations argued that it was not their mistake and the soft drinks company should be ordered to pay them.
The Supreme Court ruled out “the issues surrounding the 349 incident have been laid to rest and must no longer be disturbed in this decision”. Pepsi faced more than 1000 criminal and civil suits, but most of these cases have been dismissed.
Pepsi lawyer Alexander Poblador said that the company spent more than P200 million to pay close to 500 000 non – winning claimants as a goodwill gesture, but still suffered losses when angry claimants burnt 37 company trucks.
In addition to that, a Pepsi plant in Davao City was attacked by an angry mob, who lobbed a grenade killing three people and causing the suspension of the plant’s operation.
All these history throwbacks are nice, but we still need to look to the future and check in with the present. In the next, final part, of Software Disasters, we are going to take a look at quite a pressing issue in QA – the DevOps system and how it influences the Software Testing field, and if it is going to transform it even more. Don’t miss the grand finale to this break-room-material-worthy series soon!