Quick Tech Notes: April 2010

Friday, April 30, 2010

Release It! - Chapter 11.2 Virtual IP Addresses

Cluster servers, such as HP ServiceGuard or Veritas Cluster Server, give non-clustered applications the appearance of being clustered. They work by detecting if the instance is running or not and starting one up if needed. They use Virtual IP Addresses to redirect traffic as needed. A Virtual IP Addess is an IP address that can be moved from one NIC to another. Some lower layer TCP/IP manipulations have to be done in order make it work. In general, if your application calls any other service through a virtual IP, it must be prepared for the possibility that the next TCP packet isn’t going to the same interface as the last packet. This can cause IOExceptions in strange places. Ruthless testing is a good idea in this case. See if you can configure your testing environment to use Virtual IPs and repoint them to see how your system reacts.

Thursday, April 29, 2010

Release It! - Chapter 11.2 Routing

To help keep straight what each network interface is used for, keep a spreadsheet of:

the destination name
address
desired route

You'll need that data for firewall rules anyway. This is a very unsexy topic but it is important, especially when you need to migrate away from existing integrations over to new ones. People come and go but, hopefully, the spreadsheet stays.

Wednesday, April 28, 2010

Release It! - Chapter 11.1 Multihomed Servers

Fact: nearly every server in a data center will be multihomed, meaning it will have multiple NICs, IP addresses and be on multiple networks. Typically, different networks are used for different purposes: one for production and one for administration or backup. Bonding is where multiple NICs share the same IP address and the OS takes care of routing packets properly. This is useful for load-balancing. By default, Java will listen for traffic on all interfaces so you have to be careful and use the proper APIs to restrict what interfaces your application uses. Your application needs configuration properties that tell it what interfaces to bind to. It makes sense to configure development, or at least test's, environment to have multihomed servers so you won't get any surprises in the field. The virtualization enviornments I've seen support multiple NICs so hardware shouldn't be an issue.

Tuesday, April 27, 2010

Release It! - Chapter 10.4 Tune the Garbage Collector

In Java applications, garbage collection tuning is the quickest and easiest way to see some capacity improvements. An untuned application running at production volumes and traffic will probably spend 10% of its time collecting garbage. That should be reduced to 2% or less. You can get visibility into the garbage collector’s behavior by passing the -verbosegc argument to the JVM at start-up time. If you are using Java 5 or later, then you can use the jconsole tool that comes with the Java SDK. Once you can see the garbage collection patterns, tuning the garbage collector is largely a matter of ensuring sufficient heap size and adjusting the ratios that control the relative sizes of the generations. Perfectly tuned (if there is such a thing) settings for one release can be totally wrong for the next release so always retune for each release.

Tune the garbage collector in production - User access patterns make a huge difference in the optimal settings, so you can’t tune the garbage collector in development or QA.
Keep it up - You will need to tune the garbage collector after each major application release. If you have an annual demand cycle, you will also need to tune it at different times during the year, as user traffic shifts between features.
Don’t pool ordinary objects - The only objects worth pooling are external connections and threads. For everything else, rely on the garbage collector.

One question I have is how do you tune in a production environment without adversely impacting the business? One idea I have is to observe one of the nodes in the cluster and use that data to tune all of the nodes' settings.

Monday, April 26, 2010

Release It! - Chapter 10.3 Precompute Content

Precomputing, at least in the context of this chapter, means "you don't have to make every web page dynamic". The cost of generating dynamic HTML adds up and can really tax a system. How often do those dynamic portions of the site really change? Maybe you can rebuild a web page once an hour instead of every hit and still keep a site fresh? Precomputing content isn't free, however. It requires storage space for each piece of computed content. There is some runtime cost to mapping an identifier to a file and reading the file. For commonly used content, this cost might motivate you to cache the content itself in memory. The cost of generating the content mainly occurs when the content changes. If the content gets used many times before it changes, then precomputing it is worthwhile. Personalization works against precomputed content. If entire pages are personalized, then precomputed content is impossible. On the other hand, if just a few fragments or sections are personalized, then the majority of the page can be precomputed with a “punch out” for the personalized content. Precomputed content does not need to be an all-or-nothing approach. Some high-traffic areas of the site can be precomputed, while less frequently visited pages can be fully dynamic.

Precompute content that changes infrequently - Any content that you present many times before it changes could be precomputed to save time during request handling. Factor the cost of generating the content out of individual requests and into the deployment process.

Sunday, April 25, 2010

Release It! - Chapter 10. 2 Use Caching Carefully

Caching can be a powerful response to a performance problem. It can reduce the load on the database server and cut response times to a fraction of what they would be without caching. When misused, however, caching can create new problems. The maximum memory usage of all application-level caches should be configurable. No matter what memory size you set on the cache, you need to monitor hit rates for the cached items to see whether most items are being used from cache. If hit rates are very low, then the cache is not buying any performance gains and might actually be slower than not using the cache. It’s also wise to avoid caching things that are cheap to generate. In Java, caches should be built using SoftReference objects to hold the cached item itself. In extreme cases, it might be necessary to move to a multilevel caching approach. In this approach, you keep the most frequently accessed data in memory but use disk storage for a secondary cache. Precomputing results can reduce or eliminate the need for caching. Finally, any cache presents a risk of stale data. Every cache should have an invalidation strategy to remove items from cache when their source data changes. The strategy you choose can have a major impact on your system’s capacity.

Limit cache sizes - Unbounded caches consume memory that is better spent handling requests. Holding every object you’ve ever loaded in memory doesn’t do the users any good.
Build a flush mechanism - Whether it’s based on the clock, the calendar, or an event on the network, every cache needs to be flushed sooner or later. A cache flush can be expensive, though, so consider limiting how often a cache flush can be triggered, or you just might end up with attacks of self-denial.
Don’t cache trivial objects - Not every domain object and HTML fragment is worth caching. Seldom-used, tiny, or inexpensive objects aren’t worth caching: the cost of bookkeeping and reduced free memory outweighs the performance gain.
Compare access and change frequency - Don’t cache things that are likely to change before they get used again.

The advice to monitor cache hits is a great one. I've worked on systems were a developer swears that caching XYZ will save so much time but could never prove it because the home grown caching solution didn't provide visibility into the cache. I'm thinking that using an established caching solution, such as Ehcache, is a good place to start since they often provide the features the book suggests.

Friday, April 23, 2010

Release It! - Chapter 10. 1 Pool Connections

Resource pools can dramatically improve capacity. Resource pools eliminate connection setup time. Establishing a new database connection requires a TCP connection, database authentication, and database session setup. Taken together, this can easily take 400 to 500 milliseconds. Only starting a new thread is more expensive than creating a database connection. Connection pool sizing is a vital issue. An undersized connection pool leads to resource pool contention. An oversized connection pool can cause excess stress on the database servers. You must monitor your connection pools for contention, or this capacity enhancer will quickly become a killer.

Pool connections - Connection pooling is basic. There’s no excuse not to do it. It probably makes sense to use an established library instead of coding a pooling implementation by hand.
Protect request-handling threads - Do not allow callers to block forever. Make sure that any checkout call has a timeout and that the caller knows what to do when it doesn’t get a connection back.
Size the pools for maximum throughput - Undersized resource pools lead to contention and increased latency. This defeats the purpose of pooling the connections in the first place. Monitor calls to the connection pools to see how long your threads are waiting to check out connections.

In some application servers, you can configure the database pool to verify that the connection is still valid prior to handing it over to the client. It results in an extra network call but the overhead is probably better than handing over a busted connection.

Release It! - Chapter 9 Capacity Antipatterns

Hardware and bandwidth are expensive so use them efficiently.

Resource Pool Contention

eliminate contention under normal loads
if possible, size resource pools to the request thread pool - if there is a resource for every thread then there should not be any contention.
prevent vicious cycles - resource contention causes transactions to take longer. Slower transaction causes more resource contention.
watch for Block Threads - the pool capacity problem can quickly become a stability problem if your threads start blocking waiting for a resource

Excessive JSP Fragments

JSPs can cause permgen memory problems because each "page's" class get loaded into memory and never leaves
don't use code for content - if the content is static, then use HTML. Only use JSP for dynamic content where you need code to make it work.

AJAX Overkill
AJAX clients can overwhelm non-Google sized servers because the requests come more frequently than what a human might do. Done right, AJAX can reduce bandwidth, done wrong and it can crash your system.

Interaction Design - try and use AJAX for interactions that represent a single task in the user's mind, such as sending an e-mail. Can eliminate serving up multiple pages to walk through work flow.
Request Timings - see if your AJAX libraries allow you to modify the length between auto-complete requests.
Session Thrashing - make sure to configure your web back end for session affinity so that you don't needlessly migrate sessions.
Response Formatting - don't send back HTML pages or fragments. HTML is verbose and chews up bandwidth. Instead, send back the data and dynamically update the page on the client. Use JSON instead of XML, it is easier to parse and is less verbose.
avoid needless requests - don't use polling for auto completion. Instead, send the request when the field actually changes.
respect session architecture - make sure your AJAX requests include a session id cookie or query parameter. If you don't, the application server will create a new, wasted one for each AJAX request.
minimize the size of replies - return the least amount of data. Avoid returning HTML and hand back JSON or XML instead.
increase the size of your web tier - AJAX is chatty and make sure your web tier can handle the additional traffic.

Overstaying Sessions
Java's default session time out of 30 minutes is overkill. Analyze your traffic and try and determine what the average real session time is. Set your session timeout to one deviation away. Best bet is to avoid session all together. If it is serving as a cache and you can easily recreate its contents, throw the session away at the end of the operation.

curtail session retention - keep sessions in memory for as short a time as reasonable.
remember the users don't understand sessions - users understand automatic logout for security reasons. They don't understand their shopping cart getting emptied because they took longer than 30 minutes to complete their transaction. Things should not disappear because the user went away for a cup of coffee.
keep keys, not whole objects - keys are smaller and consume less memory. Keep whole objects only if you use SoftReferences.

Wasted Space in HTML
The more HTML you transmit that more it affects the system.

Whitespace - whitespace costs money. Try putting in an interceptor to filter out whitespace.
Expensive Spacer Images - that little 1-pixel transparent GIF chews up resources because of multiplier effects. Size things using HTML instead.
Excess HTML Tables - using Tables for formatting instead of CSS has a tremendously negative effect on capacity.
omit needless characters
remove whitespace - computers don't need nice formatting so don't pay for it
replace spacer images with non-breaking spaces or CSS
replace HTML tables with CSS layout - CSS files are loaded once, tables are loaded every time.

The Reload Button
Impatient users will hammer the reload button. Only defense is to have a web site that is fast enough. Make the reload button irrelevant - make your site fast enough.

Handcrafted SQL
Developer crafted SQL is usually bad and is difficult to find and tune. Handcrafted SQL is unpredictable while ORM generated SQL is not.

minimize hand crafted SQL
see whether the DBA laughs at your queries - don't put it into production if she does
verify gains against real data - try your SQL against production-sized
data set. If it is fast in development doesn't mean it'll be fast in
production.

Database Eutrophication
Eutrophication is a fancy term for "lethal sludge build up".

Indexing - in general, any column that is a target of an association should be indexed. Keep the DBA in the loop as development proceeds to help with evolving the database schema.
Partitioning - a vendor specific way of organizing the tables on disk
Historical Data - reporting and ad hoc analysis should not be done on the production database. Think about a multi-level storage scheme to keep the amount of data needed that doesn't build up crud in the production db.
create indexes: it's not just the DBA's responsibility - you know how your app works and the type of access it uses use that knowledge to improve the db's performance.
purge sludge - old data slows down queries and inserts so try and get it off of the production servers
keep reports out of production - don't jeopardize production with expensive queries. A data warehouse schema is different from an OLTP schema anyway.

Integration Point Latency
Performance problems for individual users become capacity problems for
the entire system. Expose yourself to latency as seldom as possible - avoid chatty protocols.

Cookie Monsters
HTTP cookies can be easily abused. Storing a serialized object graph as a cookie is a bad idea. There are security and code issues as well as capacity issues. Cookies are meant to be small (less than 100 bytes) probably because they are meant to be sent with each request -- just enough to hold a session identifier. Constantly sending 400K worth of cookie with each request is a drain on resources, including bandwidth and CPU. Cookies are useful but be mindful that the client can lie, send back stale or broken data and might not send back any cookies at all

serve small cookies - use cookies for identification, not objects.
keep session data on the server, where it can't be tampered with by a
hacker.

Thursday, April 22, 2010

Release It! - Chapter 8 Capacity

Peformance measures how fast the system processes a single transaction. Customers care about throughput or capacity. End users care about performance because they want their needs met as quickly as possible. Throughput describes the number of transactions a system can process in a given time span. Scalability, in this context, describes how throughput changes under varying loads. A graph of request per second versus response times measures scalibility. Capacity is the maximum throughput a system can sustain for a given workload.

A single constraint determines the system's capacity. The first constraint to hit its ceiling will limit the system. Understanding capacity requires systems thinking -- the ability to think in terms of dynamic variables, change over time, and interrelated connections. Consider the system as a whole and look for driving variables -- usually things outside of your control, such as user demand. Follwing variables move in response to driving variables. Examples include CPU usage, I/O rates, network bandwidth. Load and stress testing along with data analysis can help correlate following variables to driving variables. You can look at the system as a whole and run your driving variable/following variable analysis at each layer. The contraint will end up being one of the following variables that reaches its limit. Until the constraint fails, you will see a strong correlation between the driving variable and the constraint. Once you identify the constraint, either increase the resource or decrease its usage.

Be aware that stability issues, such as Cascading Failures between layers, can confuse Capacity issues with Stability issues.

Successful systems will outgrow their current resources. You can scale horizontally or vertically but you need to decide which is best for your system.

Myths About Capacity

CPU Is Cheap: CPU cycles = clock time = latency = users waiting around. Over time and billions of transactions, wasted time becomes wasted resources and money. The cost of adding CPUs can get expensive, especially if you have to add a new chassis.
Storage Is Cheap: Storage is a service, not just a disk drive. There interconnects, backups, redundant copies, etc. You also have to account for the number of servers involved in the scaling architecture. 1TB times the number of nodes in the cluster, for example. Local storage might cost $1/GB but managed storage might be $7/GB. Know your numbers.
Bandwidth Is Cheap: Dedicated connection versus a burstable connection. Just like with CPU and Storage, you have to account for multiplier effects. The more junk in your web pages, the more you have to move over the network, process and pay for.

Tips:

always look for multiplier effects -- they will dominate your costs
understand the effects of one layer on another
improving non-constraint metrics will not improve capacity
try to do most of the work when nobody is waiting for it
place safety limits on everything - timeouts, memory, connections, etc.
protect request-handling threads
monitor capacity on a continual basis -- any changes and affect scalability and performance. Change in user demand changes the work load.

Wednesday, April 21, 2010

Release It! - Chapter 7 Case Study: Trampled by Your Own Customers

In this chapter, the author describes a death march web site project that crashed after 25 minutes of going live. The application was written to pass QA's test, not go into production. There were 500 integration points and every single configuration file was written for an integration envrionment, which included hostnames, ports and passwords. Some code assumed QA's network topology and not what was going into production. They tested the application the way it was meant to be used, not like it was used in the real world. Shopbots, search engines and other non-human users had an impact on the system. No safeties were built into to cut the system off from bad things. Over time, they adapted the system to meet demand using less resources than they initially went live with. Getting there, however, was a long and painful process.

Tuesday, April 20, 2010

Release It! - Chapter 6.0 Stability Summary

The essence of keeping your systems up and running is to trust no one. Even if you think there is a one in a million chance of the API failing a particular way, given enough transactions you are going to see it happen. It is your job to expect that your integration points will disappoint you and create a system that can deal with that scenario. If you do your job right, customers won't complain about down time, they'll complain about something else.

If you think of your system architecture using the Hexagonal model, the Ports and Adapter are the points where you are most likely to encounter instability. My in-coming Ports should use Fail Fast, Handshaking and Timeouts to ensure I don't adversely affect callers into my system. The in-bound Adapter should probably use Circuit Breaker, Timeout and Handshake to cope when the system gets sick. My outbound Adapters should also use Circuit Breaker, Timeout and Handshake to verify that the services I need, such as database access, are available and able to handle my request. My point is that the Hexagonal model seems like a natural way of looking at things if you want to build a system that can survive in the wild.

Monday, April 19, 2010

Release It! - Chapter 5.8 Decoupling Middleware

Middleware is software that integrates systems that were never meant to be integrated. It integrates by passing data between two systems but also decouples the callers into the middleware from the details of the integrated systems. Middleware implemented in terms of synchronous calls are simpler to write but have the problem of blocking systems as they wait for reponses. Asynchronous middleware is harder to write but you don't have the "hurry up and wait" issues. Middleware is usually expensive and changing your mind from a synchronous model to an asychronous one is costly.

decide at the last responsible moment - try the other stability patterns first. Decoupling Middleware is an architecural decision and ripples to all parts of the system. It is likely an irreversable decision.
avoid many failure modes through total decoupling - the more you decouple servers, layers and applications the more adaptive your system will be.
learn many architectures, and choose among them -- find the best architecture for the problem at hand.

Sunday, April 18, 2010

Release It! - Chapter 5.7 Test Harness

It is really difficult to get Integration Points to fail in the ways you need them to when testing. Especially, when you are trying to test against abnormally, should never happen, conditions. A Test Harness should be devious and attempt to leave scars on the calling system. It substitutes for any Integration Point services. The author provides a whole set of network related scenarios that the Test Harness should support, such as setting up a socket connection but never sending any data. A Test Harness runs as its own server and is free to do whatever it needs to in order to simulate the numerous ways a service can break. Don't be tempted to add failure simulation code to your application. One idea is to have the Test Harness listen on different ports for different bad behavior -- port 80 might be the "send back one packet of data every 30 seconds" port while 22 might be the "accept connections but never reply" port. It allows for reuse of a test harness. You probably want your Test Harness to log what it is doing in case that app you are testing dies silently. The Test Harness can be as flexible and pluggable as you want and is likely a good use of company resources.

emulate out-of-spec failures - try and fake all of the messy, real-world failures that probably haven't been accounted for in your APIs.
stress the caller - throw slow responses, no responses or garbage and see how your application behaves.
leverage shared harnesses for common failures -- you don't need a seperate harness for each application because many of the failure modes you'll be testing for apply to many types of applications.
supplement, don't replace, other testing methods - unit, integration
and functional test verify functional behavior. The Test Harness verify
non-functional behavior while maintaining isolation from the remote
systems.

I wonder if open source test harnesses exists? It would be nice not to have to start from scratch but then again, a harness is going to be specific to your application.

Saturday, April 17, 2010

Release It! - Chapter 5.6 Handshaking

Handshaking is signaling between devices that regulate communication between them. Handshaking is found in most low-level protocols but not in the higher level ones. HTTP has a variety of codes but most developers just look for 200 OK. Handshaking is about letting the server protect itself by throttling its own workload. The server should have a way to reject work if it cannot handle it. You can achieve this by combining load balancers and web servers where the web server returns a 503 NOT AVAILABLE message to the load balancer when it makes its "are you still alive" requests. In SOA, you might want to provide a "health check" service that the clients can call before trying the real service. You get good handshaking but double the number of calls. Handshaking can help when Unbalanced Capacities trigger Slow Responses because the server can tell the client that it can't meet its SLA and should back off. If you can't use Handshaking, then Circuit Breaker might work. Your best bet is to build handshaking into any custom protocols that you invent.

create cooperative demand control - both the client and server sides need to understand and respect the Handshaking if it is to work
consider health checks - if you can't tweak the protocol then maybe you can add a health check service. You need to weigh the cost of the extra calls against the service failing.
build Handshaking into your own low-level protocols - if you make your own socket protocol, make sure to add Handshaking into it so the endpoints can notify others when they can't accept more work.

Friday, April 16, 2010

Release It! - Chapter 5.5 Fail Fast

Waiting around for a slow failure response is a huge waste of time. If the system can predict that an operation will fail, it is better to tell the caller now so his resources don't get tied up. A load balancer knows if servers are available so configure them to return immediately with a resource unavailable error rather than queing up the request and waiting around for a server to free up. Services can check the state of resource pools or Circuit Breaker prior to use and Fail Fast if their use will fail. Check for resource availability prior to the start of a transaction. Very basic parameter checking in a servlet might be useful if it can avoid pulling in resources on a transaction that is just going to fail with validation errors. Report system failures( resources unavailable) different from an application failure (invalid formatting of date). You don't want to trip a Circuit Breaker because a user entered bad data multiple times but you do if there is no disk space left.

avoid Slow Responses and Fail Fast - if your system can't meet SLA, let callers know right away and not wait for a timeout.
reserve resources, verify Integration Points early - try and allocate and verify important resource prior to doing any work. Grab that huge buffer you need and verify all the Circuit Breakers are reporting ok.
user input validation - do basic user input validation prior to reserving resources. Don't bother checking out a database connection just to find out a required parameter is missing from the call.

Thursday, April 15, 2010

Release It! - Chapter 5.4 Steady State

Every time a human touches a server is an opportunity for a mistake to be made. Try and keep people off of the production system so strive to make the system run without human intervention. A primary reason humans log into a system is to purge unwanted resources, such as log files or history tables, so automate the purge process. For every mechanism that accumulates a resource, some other mechanism must recycle that resource. Common types of sludge that can build up:

Old Data - cleaning out obsolete data from db tables is a good thing but requires careful attention to make sure that integrity is maintained.
Log Files - can fill up disks and are mostly useless. Don't leave log files on production, copy them somewhere else for analysis. Use a RollingFile appender and rotate the logs by size. Find a way to purge logs or they will sure to be the cause for a support call.
In-Memory Caching - make sure to use some form of cache invalidation. Memory caches lead to memory leaks which lead to crashes.

Tips:

avoid fiddling - human intervention leads to problems so eliminate the need for recurring human intervention.
purge data with application logic - letting a DBA write your purge scripts puts the app at risk because they don't know your ORM tool or your application logic. It is usually better to do it yourself.
limit caching - cap the amount of RAM a cache can consume
roll the logs - keep a limited number of logs and rotate them based on size. Any logs that need to be retained should be copied off of the server.

Wednesday, April 14, 2010

Release It! - Chapter 5.3 Bulkheads

On ships, bulkheads are a way to partition the ship into sections allowing you to seal them off from the rest of the ship if there is a hull breach. This allows the ship to stay afloat even if there is a hole in the ship. The same idea can be applied to software -- stay afloat if even part of the system has been damaged. In software this is done via redundancy -- multiple instances of an application server running on multiple pieces of hardware. You can also partition your system by function -- one set of servers for flight check-in and another set of system to purchase tickets. Scheduling maintenence might be another reason to use Bulkheads, since you can selectively turn off and update discrete portions of the system and still process transactions. Virtualization is a tool you can use to partition your system and still allow for the ebb and flow of demand. Some companies are using Amazon's EC2 to handle seasonal traffic, essentially renting resources just to cover their temporary needs Examining the business cost of a down piece of system functionality can help guide where Bulkheads might make sense. Redundancy has its costs so only pay to Bulkhead what is really important to the business. You can also consider a CPU Bulkhead where specific threads are bound to specific CPUs. That way, if a bad piece of code pegs a CPU, other CPUs might still be available to do work because they have been targetted with a different work load.

save part of the ship - Bulkheads partition capacity as a way to
preserve partial system functionality when bad things happen.
decide whether to accept less efficient use of resources - partitioned
systems need more reserved, but probably unused, capacity. If everything is pooled together, you might need less total reserved capacity.
pick a useful granularity - you can partition thread pools in an
application, CPUs in a server, or servers in cluster.
very important with shared services models - if you are a SOA provider and your services go down, Chain Reactions will occur and things will come to a halt. Use Bulkheads to reduce the issue.

Tuesday, April 13, 2010

Release It! - Chapter 5.2 Circuit Breaker

A software Circuit Breaker attempts to act like an electrical circuit breaker in that if the system the Circuit Breaker bridges gets sick, the Circuit Breaker opens and interaction with the sick system is prohibited. As the Circuit Breaker is used, it keeps track of failures to the bridged system. If a failure threshold is reached, the Circuit Breaker opens and access to the faulty system is prevented. Software Circuit Breakers, unlike electrical ones, can be configured to retry a call to the sick system to see if it has recovered. If it has, the Circuit Breaker is closed and traffic flows normally. If not, the Circuit Breaker remains open until it is time to attempt another check on the sick system. It is a good idea that throw a excpetion from the Circuit Breaker that lets the caller know that that failure is due to the Circuit Breaker tripping, giving the caller the opportunity to apply different logic in that scenario. Tripped Circuit Breakers will degrade your system so it is important to discuss what should be done in that scenario. Operations will surely want to know when a breaker is tripped so make sure that that state of breaker is logged and targeted to them. You should probably also provide a way to query or monitor the breaker's state. Keeping track of tripped breakers is a good way to monitor changes over time with an Integration Point. You have some ammunition with a vendor if you can cite specific data points. It is also useful to allow a manual way to trip or reset a Circuit Breaker. Circut Breakers are a guard against Integration Points, Cascading Failures, Unbalanced Capacities and Slow Responses.

don't do it if it hurts - when an Integraion Points become problemeatic, stop calling it.
use together with Timeouts - using a Timeout helps to identify a problem with an Integration Point. That information can be used to trigger a Circuit Breaker.
expose, track and report state changes - tripping a Circuit Breaker always indicates a serious problem and should be visible to operations. Circuit Breaker activity should be reported, trended and correlated.

I think that is a great idea and one I had never considered before. I'm thinking that Spring managed Circuit Breakers can be monitored and controlled via JMX, giving the operations folks the visibility they need.

Monday, April 12, 2010

Release It! - Chapter 5.1 Use Timeouts

This is the chapter I've been really looking forward to. We seen some of the ways you can hose a system from a stability standpoint, now let's seen how we can remedy some of those situations. This is a long chapter so each pattern will be broken out into a separate post.

Modern systems rely heavily on the network and networks break. Waiting for an answer that is never going to come is not a wise move. I like this tagline: "Hope is not a design method." Make sure your code doesn't wait around forever for an answer to its request. Ensure that any resource pool implementation that blocks a thread until a resource is available, should have a timeout enabled. In Java, always use the form of the concurrency APIs that take timeout, never the no-arg ones. Creating reusable code that deals with the sticky issues around thread blocking and timeouts is desirable, not to mention good programming. That way, a particular set of thread interactions are understood and shared throughout the system. Use QueryObject and Gateway to encapsulate database access logic, making it easier to apply Circuit Breaker. Some code attempts to retry after a failure but, generally speaking, that is not a wise thing to do. Networks and servers don't heal quickly and making a client wait is usually not a good thing. A better tactic is to return a result, which might be an error code or an indicator that you've queued up the request for retry at a future time. Making the client wait will likely cause a cascading failure as his callers have to sit around waiting to get their answer from him. Store-and-Forward is generally a robust solution to timeouts but each application has its own definition of "fast enough" which you need to account for. Timeouts and Circuit Breakers are a good combination because the Circuit Breaker can trip if timeouts become the norm instead of the exception. Timeouts coupled with Fail Fast are another common combination. Timeout protects you against somebody else's failure while Fail Fast is used to report to your callers why you can't complete their request. Timeouts also can take a role in Unbounded Results in that it might take too much time to load those million records you accidentally asked for.

apply Timeout to Integration Points, Blocked Threads, and Slow Response to avert Cascading Failures
apply Timeout as a way to recover from unexected failures. Sometimes you can't know the precise cause of the failure but you need to give up and move on.
consider delayed retries. Immediate retries are likely to fail and end up delaying the layer calling you. Queing up the work and trying again later is usually a better alternative.

In the past, I've written utility objects intended for use throughout the system but I never crafted them with an eye towards system stability. It makes sense to encapsulate all the gory details around network timeouts and retries into a single place. I guess this is another reason to try and keep your code DRY. I'll be interested to see what sort of Java idioms emerge from the combination of Timeouts and Circuit Breaker.

Sunday, April 11, 2010

Release It! - Chapter 4.11 Unbounded Result Sets

Unless your explicitly limit the size of your result sets, you will run out of memory at some point. QA tends to have much smaller result sets than production which makes finding issues around large result sets difficult. A good API will allow for limiting of result sizes. Objects that contain collections of other objects are a typical place where large result sets can overwhelm the system. There are less than perfect ways to get a database to limit results but they are vendor specific. Unbounded Result Sets are a common cause of slow responses.

use realistic data volumes -- use production sized data sets in QA. What happens when a million rows are returned?
don't rely on data producers -- make sure you limit the results and don't rely on the producers of the data to always keep a reasonably sized data set. Things will change and you will break.
put limits into other application-level protocols - web service, RMI, XML-RPC calls are all vunerable to returning huge collection of objects which will consume tons of memory.

Sunday, April 4, 2010

Release It! - Chapter 4.10 SLA Inversion

The best SLA you can hope to achieve is the worst SLA that any of your dependent services can provide -- DNS, credit card processing, etc.
A chain is only as strong as the weakest link. The number of dependent services also affects the SLA calculation. SLA Inversion is when a system must provide high availability but depends on systems of lower availability. One solution is to decouple from the services such that your system can continue to function despite a loss of service from a dependent service. Try Decoupling Middleware or Circuit Breakers. Another solution is to carefully craft your SLAs. Do not say 99.9% uptime -- instead break it down around specific services. That way, you can provide higher levels of SLA for services that do not depend on other services and specify lower SLAs for those that do.

don't make empty promises - if you have an SLA inversion then you cannot exceed what your dependencies support
examine every depenency - look in unexpected places, like network infrastructure such as DNS, SMTP, message queues, etc.
decouple SLAs - maintain your service even when your dependencies go down. If you fail when they do, your availability will always be less than theirs.

Saturday, April 3, 2010

Release It! - Chapter 4.9 Slow Responses

Slow Responses are harmful because they tie up resources on both the front-end and the back-end and are usually the result of high demand on the system. Memory leaks can also cause slow responses as the JVM works harder to manage memory. WANs can also cause slow responses due to network congestion. Typically, hand-rolled low-level socket code is responsible for slow responses, so be careful when dealing with socket code. Slow responses tend to result in a Cascading Failure as the effects are felt between layers. Having your system monitor its performance can give you a heads up when SLAs aren't being met.

slow response times trigger Cascading Failures as the slowness is felt upstream
slow responses causes more traffic on websites because users will get frustrated and pound on the reload button
consider Fail Fast - in a system that monitors itself you might elect to send an error response instead of letting things slow down once a certain threshold is crossed.
hunt for memory leaks or resource contention - waiting for database connections causes slow responses. Slow responses aggravates
contentions because more threads are vying for resources that are getting released at a slower rate. Memory leaks eat up CPU cycles as the JVM works to manage memory.
Inefficient low-level protocol implementations can cause network stalls, resulting in slow response. Consider using a proven library before rolling your own.

Friday, April 2, 2010

Release It! - Chapter 4.8 Unbalanced Capacities

If one layer has more capacity than a layer it calls into, it can overwhelm it. It is probably impractical to have each layer supplied with similar capacities but you can make some design decisions to help deal with the situation. The front-end can use Circuit Breaker to relieve pressure on the back-end when things get slow. The back-end can use Handshaking to notify the front end to throttle back its requests. You can also use Bulkhead to reserve capacity on the back-end to deal with other transaction types. QA usually doesn't have the budget for lots of servers so Unbalanced Capacities is usually not detected during normal testing. Using a test harness that can mimic a back-end system struggling under load can help to verify that the front-end system can degrade gracefully. On the back-end try and do some analysis to see exactly how unbalanced things might get. Stress test your back end with loads that approach the front-end's maximum capacity -- and pick an expensive transaction. The back-end should slow down but should be able to recover once the load returns to normal.

examine server and thread counts - check the ratio of front-end to back-end servers along with the number of threads each side can
handle, in production vs. QA.
stress both sides of an interface - make sure the back-end can handle sudden bursts of traffic and make sure the front-end can deal
with slow or dropped calls.

Thursday, April 1, 2010

Release It! - Chapter 4.7 Scaling Effects

Point-to-Point communication is a primary area where you will see scaling effects. The number of connections is the square of the number of instances and the number gets big quickly. Testing out these failures is next to impossible. You can try:

UDP broadcasts - ineffecient since the whole network hears all messages
TCP multi-casting - more efficient because only interested parties get messages
publish/subscribe messaging - more infrastructure needed
message queues - more infrastructure needed

Do the simplest thing that will work.

Shared resources are another area where Scaling Effects come into play. If the service is redundant and non-exclusive then you are okay -- just add more servers if needed. Exclusive access is problematic. Request queues back up waiting for their turn and the situation gets worse as more transactions are attempted. Eventually the backlog fills up and the requests are dropped at the TCP/IP layer and then things get really ugly. Share-Nothing architectures make it more difficult to fail over -- somebody has to migrate the user's session to another server, which is likely a shared resource. One compromise it to reduce the fan-in (number of servers calling into a shared resource), perhaps only having servers pair-up for fault tolerance instead of everybody knowing about everyone else.

pay attention to the differences in QA and Production environments - things work fine on a small scale but melt in a large one
watch out for point-to-point communications - scales badly but might work if you know the number of servers will remain small.
watch out for shared resources - they bottleneck, restrain capacity and are a stability threat. Stress test shared resources heavily and make sure clients will keep working if the resource slows or fails altogether.

Search This Blog