At a high level monitoring falls into two categories: proactive and reactive. Reactive monitoring happens after an event has taken place while proactive monitoring gives you a heads up some time before the event is likely to occur. Reactive monitoring is generally low hanging fruit once you’ve got a monitoring framework in place that fits your environment. You can set up alerts that a service is no longer available, a filesystem has filled up, etc. If you’re paying attention you can often build new proactive alerts based on reactive alerts that you’ve had to address…
Good proactive monitoring grows out of root cause analysis. By paying attention to the log messages and performance metrics that you’ve collected before and during an event you can often create proactive alerts that can clue you into a problem before it becomes serious. It’s also important to sift through this data with team members so you can share knowledge and troubleshooting techniques. Different folks interpret error messages differently (and not always correctly). Here are some tips for making the knowledge transfer and analysis go smoothly:
- Draw out a timeline of events. Start the timeline in the middle of the whiteboard because you’ll likely go back farther to find the root cause than you might initially think.
- Get a meeting room with a projector to make it easy for everyone to see what you see (if possible).
- Gather as much data from the time period in question and related systems as possible. The information in the logs is generally your primary troubleshooting method.
- Divide and conquer the log review process among team members in the room (if possible).
Making a group analysis a cultural norm can really help future problems be dealt with faster. You can learn more from failure than from success.