Transparency refers to the qualities that allow operators, developers, and business sponsors to gain understanding of the system’s historical trends, present conditions, instantaneous state, and future projections. Transparent systems communicate, and in communicating, they train their attendant humans. Without transparency, the system will drift into decay, functioning a bit worse with each release. Systems can mature well if, and only if, they have some degree of transparency.
Historical records have to be stored somewhere for a period of time. The historical perspective is best served by a database: the OpsDB. The OpsDB can be used to investigate anomalies or trends. Because it contains system- and business-level metrics, it can be used to identify correlations in time and across layers. Because it can be used to discover new and interesting relationships, the historical data should be broadly available through tools such as Microsoft Access and Microsoft Excel.
Predicting the Future
Good predictive models are expensive to build. It’s possible to develop “good enough” models by finding correlations in past data, which can then be used—within a certain domain of applicability—to make predictions. These correlative models can be built into spreadsheets to allow less technical users to perform “what if” scenarios. Remember, an application release can alter or invalidate the correlations on which the projections are built.
“Present status” describes the overall state of the system. This is not so much about what it is doing as what it has done. This should include the state of each piece of hardware and every application server, application, and batch job. Events are point-in-time occurrences. Some indicate normal, or even required, occurrences, while others indicate abnormalities of concern. Parameters are continuous metrics or discrete states that can be observed about the system. This is where transparency is most vital. Applications that reveal more of their internal state provide more accurate, actionable parameters. For continuous metrics, a handy rule-of-thumb definition for nominal would be “the mean value for this time period plus or minus two standard deviations.”
The Infamous Dashboard
The present status of the system is obviously amenable to a dashboard presentation. (It practically defines a dashboard.) The dashboard should be broadly visible; projecting it on a wall in the lunchroom isn’t out of the question. The more people who understand the normal daily behavior of the system, the better. Most systems have a daily rhythm of expected events. Their execution falls in the category of “required expected events.” The dashboard should be able to represent those expected events, whether or not they’ve occurred, and whether they succeeded or not.
Green - all must be true:
* All expected events have occurred.
* No abnormal events have occurred.
* All metrics are nominal.
* All states are fully operational.
Yellow - at least one must be true:
* An expected event has not occurred.
* At least one abnormal event, with a medium severity, has occurred.
* One or more parameters is above or below nominal.
* A noncritical state is not fully operational. (For example, a circuit breaker has cut off a noncritical feature.)
Red - at least one is true:
* A required event has not occurred.
* At least one abnormal event, with high severity, has occurred.
* One or more parameters is far above or below nominal.
* A critical state is not at its expected value. (For example, “accepting requests” is false when it should be true.)
Instantaneous behavior is the realm of monitoring systems. This is also the realm of thread dumps. Frameworks such as JMX also enable a view into instantaneous behavior, because they allow administrators to view the internals of a running application.