Search This Blog

Monday, May 17, 2010

Release It! - 17.6 Standards, De Jure and De Facto

SNMP
MIBs are tough to write and tougher to get operations to import. Could try to use the Java JMX-to-SNMP connector.

CIM
Replaces SNMP but not widely supported.

JMX
Great way for Java applications to provide visibility.  Spring makes it easy to expose POJOs as MBeans.  JMX makes scripting possible, which is great for the operations folk.  Scriptable administration is gold to an operations person.

What to Expose
Ideally, you should expose every state variable in the application but that isn't practical.  Try these for starters:
  • traffic indicators - Page requests total, page requests, transaction counts, concurrent sessions
  • resource pool health - Enabled state, total resources,12 resources checked out, high-water mark, number of resources created, number of resources destroyed, number of times checked out, number of threads blocked waiting for a resource, number of times a thread has blocked waiting.
  • database connection health - Number of SQLExceptions thrown, number of queries, average response time to queries
  • integration point health - State of circuit breaker, number of timeouts, number of requests,average response time, number of good responses, number of network errors, number of protocol errors, number of application errors, actual IP address of the remote endpoint, current number of concurrent requests, concurrent request high-water mark
  • cache health - Items in cache, memory used by cache, cache hit rate, items flushed by garbage collector, configured upper limit, time spent creating items
    All counters have a time component, such as "within the last 10 minutes".
JMX and SNMP Together
You can bridge the JMX and SNMP worlds and AdventNet appears to be a leader in this area.  SNMP is tree based but JMX is object based but it possible to make them work together.

Operations Database
An Operations Database accumulates status and metrics from all the servers, applications, batch jobs, and feeds that make up the extended system.  The OpsDB contains the data you will need to look for correlations and for capacity planning.  You can see what "normal" looks like for your system.  A suggested OpsDB object model is presented:
  • Feature - unit of business significant fuctionality -- same features mentioned in the SLA
  • Node - one of the "pieces" that comprise a feature, such as a web server, firewall, database, etc.  Don't model everything, just what you think is relevant.
  • Observation Type - name and sub-type of the Observation, useful in reporting.
  • Observation - a single data point obtained from a Node.
  • Measurement - a performance statistic from a Node, such as resource pool high water mark, available memory, etc. It is a type of Observation.
  • Event - a type of Observation.
  • Status - an important state change, such as a Circuit Breaker going from closed to open. It is a type of Observation.
Feeding the Database
Provide an API into the OpsDB but keep it simple.  It is not an important system component so don't stress your system by waiting around for an underperforming OpsDB.  Adjust any of your systems scripts or batch files so that they write to the OpsDB.  Can also write a JMX MBean to feed the db with data samples.

Using the Operations Database
In this section, the book addes new objects to the model.

  • ExpectationType - paired with an ObservationType.  It defines the characteristics of an Expectation.
  • Expectation - an allowed range, time frame, acceptable status, deadline, etc.  Anything that is "normal" and expected for a particular metric.  A violation will trigger an alert.  You can use historical data to fine tune these values.
    Make sure to keep the OpsDB in top shape.  Letting cruft build up will slow it down and stress your system.
Supporting Processes
You need an effective feedback mechanism or collecting and reporting this data is waste of money.

Keys to Observation:
  • Every week, review the past week’s problem tickets. Look for recurring problems and those that consume the most time. Look for particular subsystems that cause a lot of problems or a development team (if there is more than one). Look for problems related to a particular third party or integration point.
  • Every month, look at the total volume of problems. Consider the distribution of problem types. The overall trend should be a decrease in severity as serious problems are corrected. There should also be an overall decrease in volume. (There will be a sawtooth pattern as new code releases introduce new problems.)
  • Either daily or weekly, look for exceptions and stack traces in log files. Correlate these to find the most common sources of exceptions. Consider whether these indicate serious problems or just gaps in the code’s error handling.
  • Review help desk calls for common issues. The can point toward user interface improvements as well as places the system needs to be more robust.
  • If there are too many tickets and help desk calls to review thoroughly, look for the top categories. Also sample tickets randomly to find the things that make you go “hmmm.”
  • Every four to six months, recheck that old correlations still hold true.
  • At least monthly, look at data volumes and query statistics.
  • Check the database server for the most expensive queries. Have the query plans changed for any of these? Has a new query hit the most expensive list? Either of these changes could indicate an accumulation of data somewhere. Do any of the most common queries cause a table scan? That probably indicates a missing index.
  • Look at the daily and weekly envelope of demand (driving variables) and system metrics. Are traffic patterns changing? If you suddenly see that a popular time is dropping in popularity, it probably indicates that the system is too slow at those times. Is there a plateau in the driving variables? That indicates some limiting factor, probably responsiveness of the system.
    If a metric stops being useful, stop tracking it.  The system keeps changing and so must your view into it.
For each metric being reviewed, consider each of the following. How does it compare to the historical norms? (This is easy if the OpsDb has enough data to start forming expectations.) If the metrics continues its recent trend, what happens to other correlated metrics? How long could the trend continue—what limiting factor will kick in? What will result from that limiting factor?

I wonder if somebody has already written the code to implement this model?

No comments:

Post a Comment