Monitoring theory, from scratch — metrics and thresholds

In our previous post, we have dived into the lovely world of indicators and have explored its value but also its limits. These limits have clearly led us towards observable values that are past indicators. That’s also pretty intuitive, we have everyday examples of metrics.


If indicators were binary observables values, metrics are observables with a broader, more complex value.

Prometheus collects, grafana graphs for easier consumption
  1. Collect more, display less and on-demand. This core idea relies on the human factor (kudos to our first post for pointing that one out). Over-collection can be a problem since collection isn’t idealized and isn’t free (both on resource cost and impact on the monitored system) but overloading your operator is worse. If something has been collected but not displayed to the operator, then it does not overload their analysis capacity. Allowing a deep-dive on demand seems like the best of both worlds — it’s there when needed, but it does not disturb too much.
  2. Match the data aggregate (dashboard) to expected research paths. This seems trivial. If one would research a specific problem, create an aggregated view of data pertaining to that expected problem — e.g. a database-specific dashboard. Then again, it may also match a specific interaction path — e.g. a dashboard that focuses on diagnosing a particular business process rather than a silo.
  3. Offer additional data aggregates: Metrics are obtained by sampling a value. But analysis often looks at things like averages, trends and outliers. Statistics offer us a way of extrapolating our collection of samples to these: average, median, 95th percentile, 99th percentile and so forth. They are extremely useful.
  4. I wish I could say some powerful, global truth about thresholds and alerting, but it still feels like a dark balancing art — swinging between wearing out your operator and eroding their trust in the system’s output and missing out on key critical alerts.
    But maybe I can say something — that modern alternative approaches around events and transactions may lessen such a need. YMMV and I’d love feedback on this in the comments.

A Gil, of all trades. DevOps roles are often called “a one man show”. As it turns out, I’m not a man and never was. Welcome to this one (trans) woman show.