Monitoring theory, from scratch — Indicators and synthetics

Indicators

So, indicators.

An indicator you do not want to see light up.

Do ideal indicators exist?

Now, sharp readers would have probably noticed an unexpected but important word in the previous paragraph. ‘Idealized’. If only things could have really been so simple. In fact, Nagios checks are not binary values, but quaternary: OK, WARNING, UNKNOWN, CRITICAL.

The Nagios monitoring system

If ideal indicators did exist, would they have sufficed?

But assuming those limitations which would probably be inherent to any other form of monitoring, wouldn’t indicators be sufficient for everything?

  • First, not everything can be clearly made into clean and nice indicators. Some things are more vague than others. Some failures are intermittent or partial. Full coverage is also likely not feasible given the endless amount of possible failure states.
  • Second, indicators are a post-factum mechanism. Even idealized, once they have triggered, it means we’re in the problem zone. Since fixing problems is never a zero-time, zero-effort endeavor, we see considerable value in predicting trouble to begin with, something indicators are not as well-suited to.
  • Third, knowing something broke is half the battle. Fixing it is the other 80%. Indicators have their limits in helping us figure out how it broke and how to fix that. They’re simply not expressive enough.
  • Fourth, modern tech systems change. A lot. Either when their state changes, when we deploy or upgrade the system, when a third-party we interact with does so… these may influence our indicators and expectations (e.g. we may expect a certain component to be unavailable or degraded during an upgrade). These aren’t indicators we need to observe. These are events and we’ll discuss those later too.

Synthetics

A moment before we do, I would like to use this chance to review a specific specialized type of indicator, called a synthetic transaction.

Summary/Takeaways

Phew, that was a lot of theory. Is it finally time for some takeaways?

  • Try to pick indicators that offer the smallest gap between your desired definition of ‘up’ and their ability to report it.
  • Document and train operators on such gaps and implementation limits.
  • Try to pick indicators that are as discrete as possible and leave as little room as possible for interpretation.
  • If you end up scratching your head much while looking at your indicators, revise and re-iterate.
  • Remember that indicators aren’t predictive by nature and need to be complemented with other measures/systems.
  • Remember that system state changes and you need to make sure your indicators are fresh and react to it. It’s always an on-going process.
  • Consider supporting synthetic transactions in your indicator strategy, in particular if you have a complex and distributed end to end system.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gil Bahat (she/her)

Gil Bahat (she/her)

146 Followers

A Gil, of all trades. DevOps roles are often called “a one man show”. As it turns out, I’m not a man and never was. Welcome to this one (trans) woman show.