If you want to avoid tight coupling to a particular monitoring tool or framework, then log files are the way to go. Nothing is more loosely coupled than log files; every framework or tool that exists can scrape log files. Log files, however, are badly abused.
Configuration
Make the location of your log files configurable. Operations is going to want to specify where they live.
Logging Levels
Logging should be targeted to operations, not development -- they'll spend way more time with them than you will. Anything WARN or above should be seen by operations and warrant a phone call or at least some poking around into the system's current status. ERROR should be reserved for really bad stuff, such as Circuit Breaker tripping. An NPE might not be worthy of an ERROR level message, depending on the context in which it was thrown.
Catalog of Messages
Externalize your log messages into a properties file and abstract your logging sub-system to make use of it. This allows you to catalog all errors that operations might see and allow them to look up in a knowledge base.
Human Factors
Log files are for humans, not machines. Humans are good at visual pattern matching so make the format of your logs uniform and simple so that operations can quickly spot something important and act on it. Time stamp, error code, message level, component and message details are all useful things to present. Messages should include an identifier that can be used to trace the steps of a transaction. This might be a user’s ID, a session ID, a transaction ID, or even an arbitrary number assigned when the request comes in. When it’s time to read 10,000 lines of a log file (after an outage, for example), having a string to grep will save tons of time. Interesting state transitions should be logged, even if you plan to use SNMP traps or JMX notifications to inform monitoring about them. Logging the state transitions takes a few seconds of additional coding, but it leaves options open downstream. Besides, the record of state transitions will be important during post-mortem investigations.
Logging is a place where I've struggled over the years. What is too much? What is too little? Is it on by default or only when things start acting weird. I think this advice is sound, specifically only log stuff intended for operations. I've been experimenting with a logging implementation that tries to enforce these recommendations and I'm liking what I've seen thus far.
No comments:
Post a Comment