Wednesday, March 24, 2010
Release It! - Case Study: The Exception That Grounded An Airline
This was an interesting chapter in that it described a real scenario that involved the author. An airline experienced a 3 hour downtime of their check-in system that, due to the need to involve more humans to deal with the backlog, impacted most of the airline for almost 9 hours. One interesting point was to note the downstream effects of the outage which ranged from union mandated overtime, to SLA contract breaches and potential bonus money being lost. The team was able to bring the system back but people wanted to know what happened so the author had to fly on-site to perform a post-mortem and determined what went wrong. Due to strained relations between the software vendor and the end-user, source code was not made available so decompilation of the Java byte codes coupled with thread dumps and logs provided the answer: an EJB method had chosen to not deal with an SQL exception when closing a connection. This one decision caused all of the server's thread to hang waiting for a database connection that was never going succeed. The author's emphasis was not on better testing of the system in the hope of revealing that particular oversight but, instead, was at the integration of the multiple systems. His view is that the death of one of the systems should not halt the entire solution and promises to provide some guidance on how to keep a complex system healthy during times of stress.