So, our previous post has defined monitoring for us. As a brief recap, our key takeaway was that good monitoring allows a human to derive insights about your business’ operation, in order to prevent or minimize damage.
In this post we will discuss the indicator, a basic and intuitive building block of monitoring systems. Chances are your monitoring strategy will include these and your code / component should keep these in mind.
An idealized indicator is simply a binary value. On vs Off. Valid vs Invalid. Running vs Stopped. The forces of good uptime vs the forces of evil downtime, now showing at a cinema near you. Jokes aside, this seems very intuitive — if we have an attribute that we know for sure is supposed to be ‘good’ for a well-operating system and ‘bad’ for one which isn’t, it only makes sense to look at it, right?
In fact, the venerable and well-known Nagios monitoring system is focused on indicators. Indicators are the first-class residents of that system, called ‘checks’. Nagios ‘checks’ that our indicator has the expected value and aggregates that into a nice dashboard, ostensibly giving us a bird’s view of all the relevant indicators for our system.
How’s that for validation of our approach. We can construct a dashboard and never fear a problem again.
Or can we?
Do ideal indicators exist?
Now, sharp readers would have probably noticed an unexpected but important word in the previous paragraph. ‘Idealized’. If only things could have really been so simple. In fact, Nagios checks are not binary values, but quaternary: OK, WARNING, UNKNOWN, CRITICAL.
Let’s look at ‘unknown’ for a moment and see what that teaches us. Real-world indicators are not idealized and so they may fail to produce a consistent answer, or any answer at all. Like any other system, they are prone to malfunctions or technological limitations. We don’t actually have a check/indicator if a host is ‘up’, we approximate it using a technical measure (e.g. a ping request). These differences between absolute and idealized indicators to real-world ones mandate the introduction of that difference. A well-constructed indicator minimizes that difference. A human operator should ideally be trained on the implementation of the indicator and its limits.
And yet, ‘warning’ is still unexpected, to a sense. This would defy our concept of an indicator. Looking even further, we notice things like flap detection, comments, raw plugin output… possibly an indication that we’ve missed something big.
Could we have been wrong in how we approached this? Perhaps indicators aren’t even really a thing, but simply a specific case of a metric?
Now, that’s a trick question because we haven’t even gotten to discussing metrics! We’ll get to that in the next post. But intuitively, we’ll just say that a metric is a general quantifiable observable, not limited to two/three/X states.
Maybe it’s time to revert back to our definition of monitoring and see what serves us best.
If a good indicator is meant to cleanly separate good from bad, what are we supposed to do at the sight of a warning, but scratch our heads?
There is clear value in the binary, unambiguous data. It’s easy to turn into information, insights and wisdom. Metrics are obviously a different beast — in return for better expressivity, we have more ambiguity.
That’s enough metrics for now before actually discussing them in their own blog post. Let’s get back to Nagios.
So basically Nagios implements non-ideal indicators… and additional non-indicator monitoring features which at the very least help with the shortcomings. That’s actually somewhat common in the industry to have convenience features bolted on (or scope creep, if you’re cynical enough).
If ideal indicators did exist, would they have sufficed?
But assuming those limitations which would probably be inherent to any other form of monitoring, wouldn’t indicators be sufficient for everything?
The answer is unfortunately ‘no’, for at least these four reasons:
- First, not everything can be clearly made into clean and nice indicators. Some things are more vague than others. Some failures are intermittent or partial. Full coverage is also likely not feasible given the endless amount of possible failure states.
- Second, indicators are a post-factum mechanism. Even idealized, once they have triggered, it means we’re in the problem zone. Since fixing problems is never a zero-time, zero-effort endeavor, we see considerable value in predicting trouble to begin with, something indicators are not as well-suited to.
- Third, knowing something broke is half the battle. Fixing it is the other 80%. Indicators have their limits in helping us figure out how it broke and how to fix that. They’re simply not expressive enough.
- Fourth, modern tech systems change. A lot. Either when their state changes, when we deploy or upgrade the system, when a third-party we interact with does so… these may influence our indicators and expectations (e.g. we may expect a certain component to be unavailable or degraded during an upgrade). These aren’t indicators we need to observe. These are events and we’ll discuss those later too.
Are we ready to summarize yet? Almost.
A moment before we do, I would like to use this chance to review a specific specialized type of indicator, called a synthetic transaction.
Platforms like Nagios come with a wide library of checks, but for the most part they are specific component/silo checks. “Does this HTTP server respond?”, “Is this process up?”, “Does the DB return a ping?” and so forth. But it’s been years now that our systems are at least multi-tier if not distributed. This kind of bird’s view is not fully compatible with the view down on earth where our users live, reside and interact with the system.
The solution is quite intuitive — let’s put ourselves in the user’s shoes and see if the system responds the same. One could imagine this as a user doing our specific bidding and reporting back if it worked. This has all the technical qualities we would expect of an indicator, with the advantage that it checks our system from a user-like perspective and is end-to-end.
Why is it only user-like? Because our synthetic user will not share the same state or behaviour as the user and we do not want to influence the actual state of our users, just our simulation. So synthetics have their limits, but they offer great advantages in constructing one’s array of indicators towards having assurance that our system is operating as it should.
Phew, that was a lot of theory. Is it finally time for some takeaways?
- Try to pick indicators that offer the smallest gap between your desired definition of ‘up’ and their ability to report it.
- Document and train operators on such gaps and implementation limits.
- Try to pick indicators that are as discrete as possible and leave as little room as possible for interpretation.
- If you end up scratching your head much while looking at your indicators, revise and re-iterate.
- Remember that indicators aren’t predictive by nature and need to be complemented with other measures/systems.
- Remember that system state changes and you need to make sure your indicators are fresh and react to it. It’s always an on-going process.
- Consider supporting synthetic transactions in your indicator strategy, in particular if you have a complex and distributed end to end system.
Once you’ve done that, you will hopefully be awarded with a system that serves your confidence in the correct functioning of your business.
Photo credits: Ben Schumin / CC-BY-SA-3.0, Nagios reviewed under fair use