Modern programming languages can be used in large systems partly because of their ability to deal with concurrency. Concurrency, unfortunately,
provides another way for a system to fail:
Navel Gazing. Navel Gazing is a term used to describe when all the threads are sitting around waiting for some impossible event, which means that despite that the runtime hasn't crashed, your system isn't doing any work. There are four major issues around the problem:
- error conditions and exceptions create too many possible paths to test
- unexcpected interactions can introduce problems in previously safe code
- timing around thread interactions is crucial to manifesting this type of problem so you usually see it under times of high load where concurrent requests are more likely
- developers never test their code against 10,000 concurrent users
From the business perspective, if the sytem can't generate revenue then the system is dead -- no arguing over whether it has crashed, hung or is just having a bad day. Supplement internal monitoring (log files, port monitors, etc) with external monitoring (a client program sitting outside the data center running synthentic transactions -- an analog for a real live user). The upshot is that it is difficult to get concurrent code right. Your best shot is to use carefully crafted code, preferably proven libaries, especially the concurrency objects in Java 5 and above. Rolling your own pooling or caching library is expensive to get right, so don't do it -- stand on the shoulders of giants instead. Try to avoid synchronizing methods on your domain objects -- it is a bad design smell. Instead, ensure that each thread gets its own copy of the object. This helps to ensure that your code will work in a clustered environment and reduce collisions between threads, which improves throughput. In short, thread blocking issues are hard to spot. Third-party client libraries are notorious for causing blocking issues and are, typically, opaque. Write learning tests that try and break the library in various ways. Tie up connections, run it concurrently on lots of threads, reduce the amount of available memory, etc. Protect your code calling the library. Use timeouts if available. Use a thread pool that runs the API so that you can control timeouts if the API doesn't provide it natively. Be aware that if you start using worker threads to run API calls, you'll need a firm grasp of concurrency
in your programming language. Try and beat up the vendor before insulating your Integration Point with worker threads. Blocked or slow responding threads typically appear around Integration Points and can form a feedback loop that can quickly result in a cascading failure.
Blocked threads are the cause for a high proportion of system failures. Scrutinize resource pools and make sure they are configured for concurrent access. Blocked database connection pools can lead to blocked threads, incorrect exception handling and cascading failures. Use timeouts so no thread waits for ever. Don't use the no arg wait() method, use the form that accepts a timeout instead. Use proven libraries. Writing correct concurrent code is hard and leverage the work others have done before you. Beware of code you cannot see -- test and review third-party client libraries because they will fail and it is best if you have an idea of how.
No comments:
Post a Comment