Search This Blog

Thursday, March 25, 2010

Release It! - Introducing Stability

Enterprise software must be cynical meaning that it trusts no one, including itself.  Cynical software expects bad things to happen and has plans to deal with others failing to hold up their part of the bargain.

"A System is the complete interdependent set of hardware, applications and services required to process transactions for users." Translation: a system is everything needed to provide a working solution for the users, right down to the power supply and network cables.  A stable system is resilient to transient impulses, persistent stresses or component failures. In short,  the user can still get work done despite failures in some parts of the system.  A stable architecture costs the same to implement as an unstable one so doesn't it make sense to chose the former? An Impulse is a rapid shock to the system, such as a million hits to your website because of an article on Slashdot.   A Stress is force applied over an extended period of time, such as ever increasing response time from your database server or a credit card processor who can't handle your transaction load.  A Strain is a change of the shape of the system due to Stress, such as higher RAM usage due to the slow processing of an external service.

An enterprise system is supposed to run for a long time.  What is the working definition of "long time"?  How about the time between code deployments.  If you deploy new code once a year, then long time for you is 12 months.

Longevity tests are typically not run but not doing so will guarantee that bugs will appear.  If possible, dedicate a machine that runs a load testing tool that hits the machine with medium level transactions but for an extended period of time.  Make sure to put in slack periods of time to simulate slow periods in the middle of the night.  If you can't set up an entire environment, test the important parts and stub the rest.  If you don't do your own longevity testing, your production system will become your longevity test and you won't be happy.

Some component of the system will fail before everything else does and that component is known as a Crack.  The original trigger, plus the way the Crack spreads to the rest of the system coupled with the result of the damage is known as a Failure Mode.  It is in your own best interest to try and identify the Failure Modes and create a system that can withstand them.

A Crackstopper is software crumple zone designed to absorb the impact of failure and keep the rest of the system safe.  This allows you to decide what parts of the system are critical and keep cracks away from  them.  A software shock absorber, if you will.  Cracks propagate so you must do your best to prevent that.  Examples in the case study  of possible Crackstoppers include, TCP timeouts, partitioning servers better, using HTTP instead of RMI or using an asynchronous messaging protocol to decouple things.

Each failure results in a chain of events and each event in that chain can accelerate, slow or stop a crack.  Thumbrule: tight coupling accelerates cracks so reduce coupling to retard the spreading of a crack. Brute-force analysis of every resource call, I/O operation or external API call in order to identify potential failure scenarios is impractical.  Luckily, patterns do exist that can help prevent cracks from propagating and will be presented in future chapters.

No comments:

Post a Comment