Monitoring theory, from scratch — metrics and thresholds

In our previous post, we have dived into the lovely world of indicators and have explored its value but also its limits. These limits have clearly led us towards observable values that are past indicators. That’s also pretty intuitive, we have everyday examples of metrics.

Metrics

If indicators were binary observables values, metrics are observables with a broader, more complex value.

Perhaps an example would work well for us here. The amount of fuel left in your car is one such example that comes to mind. This is a quantifiable value (liters or gallons) that can predict a problem (running out of fuel). In addition, it’s great at illustrating the limits of indicators (by the time you’ve run out of fuel, you have a bigger problem to deal with.)

But… we do have an indicator for it, don’t we? It’s even called an indicator, the ‘low fuel’ warning light. It lights up when fuel availability is below a certain threshold.

But is it too low to reach the nearby gas station? Are we in the australian outback or in the middle of civilization with countless gas stations nearby?

That word again, “warning.”.

And a new one to consider, ‘threshold’.

Hmm…

And back into our tech world. The availability of resources (such as processing power, memory utilization, disk utilization, network utilization) is definitely present and lends itself towards metric collection and consumption. Resources may be generalized (such as above) or particular to a service (e.g. a specific database cache memory).

These are important to understanding our system’s stability and cannot be expressed as clear indicators. This hints to the difficulties of leveraging them — they require analysis and thresholds to turn into actionable insights.

So, we need to understand which metrics to collect and why, which metrics to float and why, when would be a good time to float them and how to analyze them.

Unfortunately, there is no silver bullet here. Systems will differ a lot in nature and analyzing them in that aspect is almost tantamount to an art.

But even in art, there are guidelines. I’d like to suggest some core ideas which I’ve found beneficial. As much as I’d love to take personal credit for them, they are usually embodied in monitoring systems which I’ve used (such as Prometheus with grafana in the following screenshot).

Prometheus collects, grafana graphs for easier consumption

So that’s our takeaways for this post:

  1. Iterate and perform RCAs (root cause analysis — in a nutshell, an investigation on what went wrong and how to do better in the future). The key to a good monitoring system is (surprisingly) monitoring its own ability in delivering value. After every incident, ask yourself some questions: “Has my monitoring system served me successfully? Do I need more metrics? Less metrics? Aggregates? Dashboards?”
    If your iteration process is good, then your system will likely take an improvement trajectory.
  2. Collect more, display less and on-demand. This core idea relies on the human factor (kudos to our first post for pointing that one out). Over-collection can be a problem since collection isn’t idealized and isn’t free (both on resource cost and impact on the monitored system) but overloading your operator is worse. If something has been collected but not displayed to the operator, then it does not overload their analysis capacity. Allowing a deep-dive on demand seems like the best of both worlds — it’s there when needed, but it does not disturb too much.
  3. Match the data aggregate (dashboard) to expected research paths. This seems trivial. If one would research a specific problem, create an aggregated view of data pertaining to that expected problem — e.g. a database-specific dashboard. Then again, it may also match a specific interaction path — e.g. a dashboard that focuses on diagnosing a particular business process rather than a silo.
  4. Offer additional data aggregates: Metrics are obtained by sampling a value. But analysis often looks at things like averages, trends and outliers. Statistics offer us a way of extrapolating our collection of samples to these: average, median, 95th percentile, 99th percentile and so forth. They are extremely useful.
  5. I wish I could say some powerful, global truth about thresholds and alerting, but it still feels like a dark balancing art — swinging between wearing out your operator and eroding their trust in the system’s output and missing out on key critical alerts.
    But maybe I can say something — that modern alternative approaches around events and transactions may lessen such a need. YMMV and I’d love feedback on this in the comments.

So, once you’ve determined your metrics and collected them, you should have better coverage of problems in your system (one that indicators could not provide alone), you should have some predictive problem forecasting (which is great) and your operators would likely be busier at analyzing stuff and more weary (treat them well like the unsung heroes they are).

Phew, that feels so comprehensive already! But we’re far from done. Next, we’d be looking at events and transactions.

Image credit: prometheus.io, fair-use under review clause.

A Gil, of all trades. DevOps roles are often called “a one man show”. As it turns out, I’m not a man and never was. Welcome to this one (trans) woman show.