Search This Blog

Friday, March 26, 2010

Release It! - Chapter 4.1 Integration Points

Chapter 4 is an interesting chapter and a very large one so I'll be breaking up my posts into sub-sections of the chapter.  In this chapter, some common scenarios are identified and labeled as Antipatterns.  Most times a solution is also recommended but we fall prey to the chicken-and-egg scenario: some of the solution patterns called out haven't been described yet so you have to guess as to what they might mean and come back later to see if things make sense in the context of the Antipattern.

One factor that affects a system's ability to be effectively operated in the field is known as Interactive Complexity.  Interactive Complexity comes from systems that have enough moving parts and internal dependences that operators cannot keep a complete or accurate mental model of the system.  Hidden linkages between components can lead to Problem Inflation where an operator turns a dial thinking that it will solve a problem but instead, due to the operator's failure to fully understand all the relationships between components, makes the situation worse.  Tight coupling between parts of the system allow for cracks to travel to other parts of the system, making diagnosis more difficult and increasing the effects of the crack. Several Antipatterns have been discovered which can be used  to identify situations which are likely to cause failures in the system.  Things will break and it is up to you to identify failure points and plan for them.

Integration Points with other systems is the number one stability killer.  Most integrations make use of sockets which can be the cause of many headaches if you don't understand some of the nuances and code defensively.  For example, the amount of time it takes to detect a connection timeout varies between OSes and usually is measured in terms of minutes.  Blocking application threads for minutes at a time waiting for a connection failure is a sure way to cripple a system.  One idea is to use to run tcpdump on the production box and analyze the capture file on a non-production box.  The tcpdump program is much lighter weight than Wireshark and provides the same information.  You can use Wireshark to analyze the capture file on a non-production machine.  Keeping a copy of The TCP/IP Guide or TCP/IP Illustrated around can help guide you through network related failures.  One thing to remember is that TCP/IP was created in a time before firewalls existed so there are classes of failures around the interaction between the firewall and the TCP/IP stack.  Idle connections and firewalls are a bad mix.  HTTP client libraries can also be a problem because HTTP sits atop TCP/IP, which means it also has timeout related issues.  Most HTTP libraries are not cynical and do not provide timeouts in their API, making it harder to code defensively.  The  Apache HTTP Client library does support timeouts and is one API that can be used in cynical programs.  Vendor APIs, such as database drivers, are another source of integration pain.  Many times the server code is rock solid and cynical but the client-side APIs are not, which eventually causes your integration threads to block. Since you have no direct control over the API, all you can do is to identify bugs and work with your vendors.  Patterns exists to help combat integration point failures, such as Circuit Breaker and Decoupling Middleware.  One tactic you can use is to write a test harness that will verify that your software is cynical enough.  Make sure it can simulate integration and network failures.  You'll also want to run your tests under load, using JMeter or some other load tester.  Every integration point will fail.  Count on it and be prepared.  Failures never come in the form of a nicely formatted error message complete with a cause and solution.  Instead, you see slowness in responses, hanging, etc.  In order to debug what is going on, you'll likely have to peel away some abstractions and look deeply into the integration using decompilers and network sniffers.  One sad fact is that a remote system's failure will become your problem if you are not defensive enough.

No comments:

Post a Comment