Search This Blog

Wednesday, March 31, 2010

Release It! - Chapter 4.6 Attacks of Self-Denial

A Self-Denial Attack describes a situation where the system, or its users, conspire against itself.  The "select offer" sent to 10,000 users is likely to become 1 million users due to network effects, such as people tweeting a discount code trying to get that $100 iPad.  Deep links that bypass the front layer of your web site is an example of something that contributes to Self-Denial, because the deeper links aren't likely to have any special provision for extra traffic like a special landing page might.  Sites go dark because they can't handle the unexpected load. In a horizontal architecture, it is possible that a single rogue server can bring down an entire layer.  A single server that manages cache coherency or a lock manager are examples of where the failure of a single server can hose the entire layer.  Strive for a "share nothing" architecture.  If that is not possible look to use Decoupling Middleware or hot standby techniques.  Design in a fall back protocol to deal with the situation when the shared resource is not available.  Dedicate a portion of your system to that new promotion so that if it melts down, the remaining portion of the system can still function. Remember to fail fast when something goes dark so that calling layers don't hang waiting for an answer that is never going to come.  Communication is a prime defense against a coming surge of transactions based on a promotion or campaign.

Keep lines of communication open.  Since Self-Denial attacks originate from within the organization you can't account for campaigns unless you know about them.  Never use deep links, use a landing page instead that can direct traffic as needed.  Watch out for embedded session ids in URLs which can strain the system as it tries to manage sessions. Protect shared resources.  Figure out how not make them a single point of failure.  Expect rapid redistribution of any cool or valuable offer -- retweets and bargain hunting sites can multiply the expected traffic.

Tuesday, March 30, 2010

Release It! - Chapter 4.5 Blocked Threads

Modern programming languages can be used in large systems partly because of their ability to deal with concurrency.  Concurrency, unfortunately,
provides another way for a system to fail: Navel Gazing.  Navel Gazing is a term used to describe when all the threads are sitting around waiting for some impossible event, which means that despite that the runtime hasn't crashed, your system isn't doing any work.  There are four major issues around the problem:
  • error conditions and exceptions create too many possible paths to test
  • unexcpected interactions can introduce problems in previously safe code
  • timing around thread interactions is crucial to manifesting this type of problem so you usually see it under times of high load where concurrent requests are more likely
  • developers never test their code against 10,000 concurrent users
 From the business perspective, if the sytem can't generate revenue then the system is dead -- no arguing over whether it has crashed, hung or is just having a bad day.  Supplement internal monitoring (log files, port monitors, etc) with external monitoring (a client program sitting outside the data center running synthentic transactions -- an analog for a real live user).  The upshot is that it is difficult to get concurrent code right.   Your best shot is to use carefully crafted code, preferably proven libaries, especially the concurrency objects in Java 5 and above. Rolling your own pooling or caching library is expensive to get right, so don't do it -- stand on the shoulders of giants instead.  Try to avoid synchronizing methods on your domain objects -- it is a bad design smell. Instead, ensure that each thread gets its own copy of the object.  This helps to ensure that your code will work in a clustered environment and reduce collisions between threads, which improves throughput.  In short, thread blocking issues are hard to spot.  Third-party client libraries are notorious for causing blocking issues and are, typically, opaque.  Write learning tests that try and break the library in various ways.  Tie up connections, run it concurrently on lots of threads, reduce the amount of available memory, etc.  Protect your code calling the library. Use timeouts if available.   Use a thread pool that runs the API so that you can control timeouts if the API doesn't provide it natively.  Be aware that if you start using worker threads to run API calls, you'll need a firm grasp of concurrency
in your programming language.   Try and beat up the vendor before insulating your Integration Point with worker threads.  Blocked or slow responding threads typically appear around Integration Points and can form a feedback loop that can quickly result in a cascading failure. 

Blocked threads are the cause for a high proportion of system failures.  Scrutinize resource pools and make sure they are configured for concurrent access. Blocked database connection pools can lead to blocked threads, incorrect exception handling and cascading failures.  Use timeouts so no thread waits for ever.  Don't use the no arg wait() method, use the form that accepts a timeout instead. Use proven libraries.  Writing correct concurrent code is hard and leverage the work others have done before you. Beware of code you cannot see -- test and review third-party client libraries because they will fail and it is best if you have an idea of how.

Monday, March 29, 2010

Release It! - Chapter 4.4 Users

Every user consumes system resources so you need to know how your system reacts to abnormally high demand.  Try and keep user sessions as small as possible.  It is probably better to re-execute an operation than it is to cache data for 30 minutes that is never accessed and puts the system at risk.  Using Java's SoftReferences can be a good compromise but requires more complex coding.  You have to be aware that some users are more expensive than others because they do more -- like buy stuff.  There is no direct defense against expensive users but you can test and ensure your system can handle double the number of current level of expensive users (which implies that you have a way of figuring out how many expensive users you have).   Web sessions are the weakness of web applications so be aware of them.

Some users are bad either by accident or by design.  Some users may deviate from the expected work flow and stress your system.  Some users are actually programs, such as spiders or robots, that can stress your system.  Legitimate search engines will honor a robots.txt file which can allow you to control access to your site.  Others don't so you have two choices:  add firewall rules to prevent access from unwanted ip blocks or create a "terms of service" agreement and sic lawyers on the offending parties. 

Remember that users consume memory so make sure that sessions are used as caches and not the "database of record" so that they may be safely purged when memory becomes scarce. Some users are weird and will do strange things which cannot be defended against.  Malicious users exists and you can help your cause by knowing your network design and keeping all your software patched and up to date.  Users gang up on you when your site is the focal point  of sudden interest, like when referenced by Slashdot.  Use special stress tests to ensure your system can handle the extra load.

Sunday, March 28, 2010

Release It! - Chapter 4.3 Cascading Failures

A Cascading Failure is when a crack in one layer triggers a crack in a calling layer.  The failure jumps between the layers when bad behavior in the caller gets triggered by a problem in the called layer.  Resource pools often get exhaused in this scenario and Integration Points without timeouts is a sure way to cause a Cascading Failure.  Cascading Failure are crack accelerators so preventing them is very important.  Stop cracks from spanning layers and make sure calling layers can still function even after a lower layer goes dark.  Examine your resource pools to ensure they are safe.  Safe pools always limit the time a thread can wait to check out a resource.  Defend against Cascading Failures by using Circuit Breaker and Timeout  patterns.  Circuit Breakers prevent call outs to sick layers and Timeouts ensure that you can return from a call to a sick layer.

Saturday, March 27, 2010

Release It! - Chapter 4.2 Chain Reactions

Horizontal scaling is scaling by adding additional servers.  Vertical scaling is scaling via larger boxes.  Horizontal scaling uses load balancers to provide fault tolerance.  Horizontally scaled systems can exhibit a failure mode known as Attack of Self Denial.  The essence of the failure is that when a single node goes dark, the extra load on the remaining servers cause them to fail, typically due to a resource leak or load related bug.  Since all servers have the same bug, they will eventually fail in the same load-related way forming a Chain Reaction.  As more and more servers fail, the remaining servers fail faster and faster until the entire layer is dead.  The only way to stop the cascade is to fix the leak.  You can attempt  to group your servers via the Bulkhead pattern and break the chain reactions into separate chain reactions, which might give you  enough time to bring up another set of servers.  Accept this: one down server places the remaining servers in jeopardy.  A dead layer will then endanger the dependent layers.  Hunting for resource leaks is important because resource leaks are a primary killer of systems. You must also hunt for obscure timing bugs by load testing your system.  If you use the Bulkhead pattern on the server side and the Circuit Breaker pattern on the client side, you can prevent chain reactions that can take out entire layers.

Friday, March 26, 2010

Release It! - Chapter 4.1 Integration Points

Chapter 4 is an interesting chapter and a very large one so I'll be breaking up my posts into sub-sections of the chapter.  In this chapter, some common scenarios are identified and labeled as Antipatterns.  Most times a solution is also recommended but we fall prey to the chicken-and-egg scenario: some of the solution patterns called out haven't been described yet so you have to guess as to what they might mean and come back later to see if things make sense in the context of the Antipattern.

One factor that affects a system's ability to be effectively operated in the field is known as Interactive Complexity.  Interactive Complexity comes from systems that have enough moving parts and internal dependences that operators cannot keep a complete or accurate mental model of the system.  Hidden linkages between components can lead to Problem Inflation where an operator turns a dial thinking that it will solve a problem but instead, due to the operator's failure to fully understand all the relationships between components, makes the situation worse.  Tight coupling between parts of the system allow for cracks to travel to other parts of the system, making diagnosis more difficult and increasing the effects of the crack. Several Antipatterns have been discovered which can be used  to identify situations which are likely to cause failures in the system.  Things will break and it is up to you to identify failure points and plan for them.

Integration Points with other systems is the number one stability killer.  Most integrations make use of sockets which can be the cause of many headaches if you don't understand some of the nuances and code defensively.  For example, the amount of time it takes to detect a connection timeout varies between OSes and usually is measured in terms of minutes.  Blocking application threads for minutes at a time waiting for a connection failure is a sure way to cripple a system.  One idea is to use to run tcpdump on the production box and analyze the capture file on a non-production box.  The tcpdump program is much lighter weight than Wireshark and provides the same information.  You can use Wireshark to analyze the capture file on a non-production machine.  Keeping a copy of The TCP/IP Guide or TCP/IP Illustrated around can help guide you through network related failures.  One thing to remember is that TCP/IP was created in a time before firewalls existed so there are classes of failures around the interaction between the firewall and the TCP/IP stack.  Idle connections and firewalls are a bad mix.  HTTP client libraries can also be a problem because HTTP sits atop TCP/IP, which means it also has timeout related issues.  Most HTTP libraries are not cynical and do not provide timeouts in their API, making it harder to code defensively.  The  Apache HTTP Client library does support timeouts and is one API that can be used in cynical programs.  Vendor APIs, such as database drivers, are another source of integration pain.  Many times the server code is rock solid and cynical but the client-side APIs are not, which eventually causes your integration threads to block. Since you have no direct control over the API, all you can do is to identify bugs and work with your vendors.  Patterns exists to help combat integration point failures, such as Circuit Breaker and Decoupling Middleware.  One tactic you can use is to write a test harness that will verify that your software is cynical enough.  Make sure it can simulate integration and network failures.  You'll also want to run your tests under load, using JMeter or some other load tester.  Every integration point will fail.  Count on it and be prepared.  Failures never come in the form of a nicely formatted error message complete with a cause and solution.  Instead, you see slowness in responses, hanging, etc.  In order to debug what is going on, you'll likely have to peel away some abstractions and look deeply into the integration using decompilers and network sniffers.  One sad fact is that a remote system's failure will become your problem if you are not defensive enough.

Thursday, March 25, 2010

Release It! - Introducing Stability

Enterprise software must be cynical meaning that it trusts no one, including itself.  Cynical software expects bad things to happen and has plans to deal with others failing to hold up their part of the bargain.

"A System is the complete interdependent set of hardware, applications and services required to process transactions for users." Translation: a system is everything needed to provide a working solution for the users, right down to the power supply and network cables.  A stable system is resilient to transient impulses, persistent stresses or component failures. In short,  the user can still get work done despite failures in some parts of the system.  A stable architecture costs the same to implement as an unstable one so doesn't it make sense to chose the former? An Impulse is a rapid shock to the system, such as a million hits to your website because of an article on Slashdot.   A Stress is force applied over an extended period of time, such as ever increasing response time from your database server or a credit card processor who can't handle your transaction load.  A Strain is a change of the shape of the system due to Stress, such as higher RAM usage due to the slow processing of an external service.

An enterprise system is supposed to run for a long time.  What is the working definition of "long time"?  How about the time between code deployments.  If you deploy new code once a year, then long time for you is 12 months.

Longevity tests are typically not run but not doing so will guarantee that bugs will appear.  If possible, dedicate a machine that runs a load testing tool that hits the machine with medium level transactions but for an extended period of time.  Make sure to put in slack periods of time to simulate slow periods in the middle of the night.  If you can't set up an entire environment, test the important parts and stub the rest.  If you don't do your own longevity testing, your production system will become your longevity test and you won't be happy.

Some component of the system will fail before everything else does and that component is known as a Crack.  The original trigger, plus the way the Crack spreads to the rest of the system coupled with the result of the damage is known as a Failure Mode.  It is in your own best interest to try and identify the Failure Modes and create a system that can withstand them.

A Crackstopper is software crumple zone designed to absorb the impact of failure and keep the rest of the system safe.  This allows you to decide what parts of the system are critical and keep cracks away from  them.  A software shock absorber, if you will.  Cracks propagate so you must do your best to prevent that.  Examples in the case study  of possible Crackstoppers include, TCP timeouts, partitioning servers better, using HTTP instead of RMI or using an asynchronous messaging protocol to decouple things.

Each failure results in a chain of events and each event in that chain can accelerate, slow or stop a crack.  Thumbrule: tight coupling accelerates cracks so reduce coupling to retard the spreading of a crack. Brute-force analysis of every resource call, I/O operation or external API call in order to identify potential failure scenarios is impractical.  Luckily, patterns do exist that can help prevent cracks from propagating and will be presented in future chapters.

Wednesday, March 24, 2010

Release It! - Case Study: The Exception That Grounded An Airline

This was an interesting chapter in that it described a real scenario that involved the author.  An airline experienced a 3 hour downtime of their check-in system that, due to the need to involve more humans to deal with the backlog, impacted most of the airline for almost 9 hours.  One interesting point was to note the downstream effects of the outage which ranged from union mandated overtime, to SLA contract breaches and potential bonus money being lost.  The team was able to bring the system back but people wanted to know what happened so the author had to fly on-site to perform a post-mortem and determined what went wrong.  Due to strained relations between the software vendor and the end-user, source code was not made available so decompilation of the Java byte codes coupled with thread dumps and logs provided the answer: an EJB method had chosen to not deal with an SQL exception when closing a connection.  This one decision caused all of the server's thread to hang waiting for a database connection that was never going succeed.  The author's emphasis was not on better testing of the system in the hope of revealing that particular oversight but, instead, was at the integration of the multiple systems.  His view is that the death of one of the systems should not halt the entire solution and promises to provide some guidance on how to keep a complex system healthy during times of stress.

Tuesday, March 23, 2010

Release It! - Introduction

The primary notion that software developers have to embrace is that architectural and design decisions are really financial decisions.  Any time a system is in an unusable state usually means lost revenue.  Software spends most of its life operating in a data center and not in development, so it makes sense to optimize for operations and not development.  Developers are often taught what a system should do, such as make sure the SSN is properly formatted, and not what a system should not do, like running out of memory on constant basis forcing a nightly reboot of the servers.  Most software is designed to pass QA's tests, not survive in the wild for the next 10 years, which often leads to financial impacts. 

The manufacturing domain has the notion of design for manufacturability (DFM) which is the idea of designing products that can be manufactured at low cost with high quality.  Software's idea of DFM might be called design for production.

Decisions made early in a system's development have the most impact on availability, capacity and flexibility.  Unfortunately, the early decisions are often the least informed.  The goal is to make software that is fast and cheap to build, good for users and cheap to operate.  In order to accomplish this, you need to use improved architectural and design techniques.   These techniques need to account for the fact that the scope of software has changed over the years, 25,000 concurrent users and 99.99% uptime on commodity hardware,  instead of a couple hundred users running on mainframes.

Architectures can be split into two groups: Ivory Tower or Pragmatic.  Ivory Tower architectures focus on the end state of the system and are less worried about the messy details.  Ivory Tower architects tend not to listen to coders and probably don't listen to users either.  Ivory Tower architects tend to issue "one size fits all" edicts, such as "you should always code the presentation tier using JSF" or "EJBs should always be used for business logic".  

Pragmatic architectures tend to use lower level abstractions and focus on the dynamic nature of the system.  "How can we update the system without forcing a reboot?" or "How can we make sure the system runs in a lower bandwidth environment?"

Many systems are designed in an artificially clean environment and don't account for real-world settings.  Designers need to reorient their thinking: release 1.0 is not the end of the project but, instead, is the beginning of the software's long life.

Monday, March 22, 2010

Release It! - Preface

Michael T. Nygard is the author of Release It!: Design and Deploy Production-Ready Software which looks to be an interesting software book.  As I read it, I plan to summarize my understanding of each chapter in the hopes that I'll better understand and retain the information.  This post covers the first chapter.

The preface gives us a "lay of the land" and sets expectations.   The book is aimed at anyone creating an enterprise-class application.  What is an enterprise-class application?  Any application where if it goes down the company loses money.  I suspect that Twitter, Facebook and many parts of Google would be considered enterprise-class.  If GMail stops working then Google can't serve its ads to the millions of people who use it.  Probably not good.  If you think about your own application, do you wish to avoid those late night support calls?  Is it your hope that the operations people can properly deploy your application into their data centers and keep it tuned and humming?  If any of these apply to your application, perhaps this book can help you.  The author hopes that the reader will come to the realization that no matter how much you plan, bad things will happen to your software when released into the wild and you need to account for those things.

 The book is broken up into four sections with each section prefaced with a real-world case study.  The names in the studies have been changed to protect the innocent but the money lost and the pain that was felt remain.

Section one revolves around keeping the system alive.  Nobody is going to worry about paying for new features in your next release if they have to reboot the system every day just to keep it breathing.  The system must be stable.

Section two addresses capacity.  What does capacity mean?  How can it be optimized over time?  What are some design patterns and anti-patterns that affect capacity?

Section three discusses how to make your system more easily deployable into the data center.  Hardware, software and networking has changed over the years and we need to be aware of those changes in order to make deployment more palatable to the operations folk.

Finally, section four is about obtaining intel from your running system.  How do you gather relevant data?  What do you do with that data?

Saturday, March 13, 2010

How do I upgrade the RAM in my Acer Aspire Revo?

I finally got around to throwing in another 1GB stick of RAM into the machine I've be using as my media center.  The scariest part was trying to figure out how to crack open the case without breaking anything.  Luckily, netbooked on YouTube has posted a video on how to open the thing up.  Once you know the secret, it is pretty simple.  I cracked open the case to see exactly what type of RAM was already installed and then ordered a similar stick from Crucial.  The Revo can take up to 4 GB of RAM but that means throwing away the existing stick and spending more money than I wanted.  Adding the 1GB stick only cost me $25 USD so despite the fact that some people are of the opinion that the extra RAM is a waste of money, I went ahead and did it anyway.  With the additional RAM I was able to dedicate 512 MB to the video card.  I haven't played with the system enough to know if video performance has improved any but I'll report any findings I may have in a future post.  All in all, the upgrade was inexpensive and painless.

Friday, March 12, 2010

Is Adobe Flash as evil as Apple claims?

I was listening to This Week in Android and they talked about a version of Android that is in development that runs, wait for it, Adobe Flash.  The application they've been showcasing is Farmville, which made me chuckle a bit.  Is Farmville a harmless time waster or digital crack that has the potential to turn the entire world into obsessed farming zombies?  I digress.  The interesting point in the discussion is that Flash did not consume resources, including battery life, nearly as much as Apple would have you believe.  It could be that the newer handsets have better chipsets -- they are putting 1Ghz chips into phones now that are capable of running Quake 3-- or that Android manages resources better than the iPhone OS.  In short, we should start to see if Adobe Flash is the mobil vampire that Apple claims it is.  My money is that it isn't.

Monday, March 8, 2010

Why should Apple tell me what I can and cannot buy?

I don't develop mobile applications, but if I were I'd spend some serious time thinking before committing to the iPhone platform.  I know Apple is the market leader and their stuff, for the most part, just works but I'm disliking Apple more and more each day.  Today, I learned that they are pulling network scanning applications from the app store.   Last month Apple pulled the boobie apps from the store.  Parents don't want their children downloading software with half-naked women onto their iPods, which I completely understand.  Today, they removed a type of tool that I personally use -- something that can detect if there is an open wifi hotspot in the area.  I'm sure Apple will issue a press release explaining why they did it and will apologize to any developers that may have been affected but I'm no longer interested in hearing Apple's rhetoric.  As a developer, you have to pony up for an Apple laptop if you are going to develop on their platform -- period.  How would you feel if you bought one of their expensive computers just so you can create your own application in the hopes that you might make a few bucks and Apple deems your app as inappropriate for their platform?  No warning, no guidelines, no apologies -- your application no longer welcome in the app store.  As a developer I would be furious.  As a consumer, I'm not too happy either.  Apple is now telling me what I can and cannot purchase.  I don't like the Big Brother aspect of the company and I'm very likely to buy an Android phone when I finally decide to buy a smart phone.  I'm hoping that Android continuously eats away at Apple's market share and forces them to be a little less controlling and let consumers may their own purchasing decisions.  Thanks for letting blow off a little steam.  I feel better now.

Saturday, March 6, 2010

What does "Hello World" look like in Google App Engine?

Google's App Engine is an platform as a service cloud offering that is free for most uses.  The documentation is pretty clear but I found that some of the steps did not work out of the box for me.  For that reason, I'm going to take some time and document the steps I used to get a "Hello World" web application pushed out into Google's cloud.

Context: all I want to do for now is to be able to build and deploy the simplest web application possible.  I'm using version 1.3.1 of the App Engine SDK, JDK 1.6.0_17,  Ant 1.7.1, IntelliJ IDEA 9.0.1 Ultimate Edition and Ubuntu Linux 9.10.

Step 1: get your self a Google App Engine (GAE) account and an application id.  These steps are nicely documented on the Google App Engine Experiments blog and I suggest you see his directions for getting yourself set up.

Step 2: get a project template in place.  Luckily, the GAE SDK provides a nice template complete with "Hello World" code.  Copy the contents of appengine-java-sdk-1.3.1/demos/new_project_template to your working directory. 

Step 3: edit the build.xml file so that it points to your installation of the GAE SDK.  The Ant property you are looking for is appengine.sdk.

Step 4: edit src/WEB-INF/appengine-web.xml so that that the application tag contains your application id.  If you miss this step, your application will not deploy.

Step 5:  build the project.  All you do is type ant and wait a couple seconds.  You'll see messages about classes being enhanced with DataNucleus, which is framework that sits in front of various persistence engines.  I'm guessing GAE uses this to more easily support JPA and JDO.

Step 6: run a quick test of the web app.  To do this, we'll launch the bundled web container via ant runserver.  If everything is working, you should be able to point your browser at http://localhost:8080/ and see a nice little "Hello App Engine" message.  When you are done, ctrl-c the server to shut it down (I haven't figured out a nice way to shut it down yet).

Step 7: deploy the WAR into the cloud.  The Ant file has an update target but that failed for me.  ant update resulted in "Your authentication credentials can't be found and may have expired. Please run appcfg directly from the command line to re-establish your credentials."  I found a thread that seems to indicate that the cookie used to store credential expires after 24 hours.  This is what I did to make things work: /opt/appengine-java-sdk-1.3.1/bin/ update www followed by ant update the rest of the day.  To verify that my app was uploaded I visited and was pleased to see the app running as expected.  IntelliJ IDEA has support for GAE and I verified that I can use that as well to upload the application.

It is fairly obvious that the GAE build system expects a certain directory structure layout and performs byte code manipulations -- it isn't just a simple matter of compiling .java files and packaging them up into a WAR file.  To help me learn how the build system works, I plan on porting the build to use Gradle.  I also have a Spring MVC application based on Spring 3.0.0 that I'm going to port over to GAE to learn about the restrictions GAE imposes on your application.  Expect posts on those two topics in the not so distant future.

Friday, March 5, 2010

How do I automount my virtual box shared folder in an Ubuntu guest?

I'm a big fan of VirtualBox and I run a whole bunch of virtual machines based on Ubuntu.  One of vbox's features is to allow the virtual machine access to the host's file system.  I always forget how to do it and have to troll the web for a solution.  I've done this enough times that I figured I should save myself some time and outline the steps here:
  • mkdir share
  • sudo mount -t vboxsf -o uid=1000,gid=1000 sharename mountpoint The user id and group id should be adjusted to match your account's numbers.
  • put into /etc/rc.local this line:  mount -t vboxsf -o uid=1000,gid=1000 .  This will remount the share during boot up. Notice how we didn't have to specify sudo in this case.
Replaced the user and group ids to match your account's ids. Many thanks to this blog post which gave me most of the information I needed.