Monitoring theory, from scratch — metrics and thresholds

Metrics

Prometheus collects, grafana graphs for easier consumption
  1. Iterate and perform RCAs (root cause analysis — in a nutshell, an investigation on what went wrong and how to do better in the future). The key to a good monitoring system is (surprisingly) monitoring its own ability in delivering value. After every incident, ask yourself some questions: “Has my monitoring system served me successfully? Do I need more metrics? Less metrics? Aggregates? Dashboards?”
    If your iteration process is good, then your system will likely take an improvement trajectory.
  2. Collect more, display less and on-demand. This core idea relies on the human factor (kudos to our first post for pointing that one out). Over-collection can be a problem since collection isn’t idealized and isn’t free (both on resource cost and impact on the monitored system) but overloading your operator is worse. If something has been collected but not displayed to the operator, then it does not overload their analysis capacity. Allowing a deep-dive on demand seems like the best of both worlds — it’s there when needed, but it does not disturb too much.
  3. Match the data aggregate (dashboard) to expected research paths. This seems trivial. If one would research a specific problem, create an aggregated view of data pertaining to that expected problem — e.g. a database-specific dashboard. Then again, it may also match a specific interaction path — e.g. a dashboard that focuses on diagnosing a particular business process rather than a silo.
  4. Offer additional data aggregates: Metrics are obtained by sampling a value. But analysis often looks at things like averages, trends and outliers. Statistics offer us a way of extrapolating our collection of samples to these: average, median, 95th percentile, 99th percentile and so forth. They are extremely useful.
  5. I wish I could say some powerful, global truth about thresholds and alerting, but it still feels like a dark balancing art — swinging between wearing out your operator and eroding their trust in the system’s output and missing out on key critical alerts.
    But maybe I can say something — that modern alternative approaches around events and transactions may lessen such a need. YMMV and I’d love feedback on this in the comments.

--

--

--

A Gil, of all trades. DevOps roles are often called “a one man show”. As it turns out, I’m not a man and never was. Welcome to this one (trans) woman show.

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Performance Improvements via JIT Optimization

“Managing AWS Cloud Services using AWS-CLI”

Build a full-stack website using Flask

flask logo

How to separate your credentials, secrets, and configurations from your source code with…

AWS Amplify Pitfalls and Solutions

Data Versioning with DVC

What’s New in C# 9.0

How to create a free, custom website in 15 minutes

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Gil Bahat (she/her)

Gil Bahat (she/her)

A Gil, of all trades. DevOps roles are often called “a one man show”. As it turns out, I’m not a man and never was. Welcome to this one (trans) woman show.

More from Medium

Load data from PostgreSQL into Autonomous Database using Oracle GoldenGate

TerraForm Enterprise (and Cloud) Variable Sets

Bursting MongoDB to a Remote Kubernetes Clusters in Minutes — Part 2

Running K3S on Arm