A single constraint determines the system's capacity. The first constraint to hit its ceiling will limit the system. Understanding capacity requires systems thinking -- the ability to think in terms of dynamic variables, change over time, and interrelated connections. Consider the system as a whole and look for driving variables -- usually things outside of your control, such as user demand. Follwing variables move in response to driving variables. Examples include CPU usage, I/O rates, network bandwidth. Load and stress testing along with data analysis can help correlate following variables to driving variables. You can look at the system as a whole and run your driving variable/following variable analysis at each layer. The contraint will end up being one of the following variables that reaches its limit. Until the constraint fails, you will see a strong correlation between the driving variable and the constraint. Once you identify the constraint, either increase the resource or decrease its usage.
Be aware that stability issues, such as Cascading Failures between layers, can confuse Capacity issues with Stability issues.
Successful systems will outgrow their current resources. You can scale horizontally or vertically but you need to decide which is best for your system.
Myths About Capacity
- CPU Is Cheap: CPU cycles = clock time = latency = users waiting around. Over time and billions of transactions, wasted time becomes wasted resources and money. The cost of adding CPUs can get expensive, especially if you have to add a new chassis.
- Storage Is Cheap: Storage is a service, not just a disk drive. There interconnects, backups, redundant copies, etc. You also have to account for the number of servers involved in the scaling architecture. 1TB times the number of nodes in the cluster, for example. Local storage might cost $1/GB but managed storage might be $7/GB. Know your numbers.
- Bandwidth Is Cheap: Dedicated connection versus a burstable connection. Just like with CPU and Storage, you have to account for multiplier effects. The more junk in your web pages, the more you have to move over the network, process and pay for.
- always look for multiplier effects -- they will dominate your costs
- understand the effects of one layer on another
- improving non-constraint metrics will not improve capacity
- try to do most of the work when nobody is waiting for it
- place safety limits on everything - timeouts, memory, connections, etc.
- protect request-handling threads
- monitor capacity on a continual basis -- any changes and affect scalability and performance. Change in user demand changes the work load.