Monitoring theory, from scratch — The definitions

Welcome!

This series of posts is designed to let you, the reader, broaden your perspectives about the world of tech monitoring by way of a “from scratch” exploration.

As a ‘foundations’ series, this should hopefully prove useful to you regardless of your specific technical domain, even if the focus will be online software systems.

If you’re an ops person or an architect (or the manager of), this should help you evaluate your existing monitoring situation and perhaps gain some more ideas and insights.

If you’re a developer (or the manager of), this should help you empower your ops team and get better and more accurate bug reporting, analysis and improvement to overall system quality, notwithstanding emergencies and on-call duties in which you may personally take part in.

Even if you’re a seasoned DevOps engineer, this may help you ratify and rationalize what you intuitively feel is right or wrong about monitoring as you know it.

Monitoring — it’s not rocket science! Except when it actually is…

So, before we begin… What even is monitoring?

We could turn to google for that, but that wouldn’t really count towards our “from scratch” commitment.

Intuitively, something we’re not looking at or observing cannot be called ‘monitored’.

  • Monitoring is about observing.

Okay, that’s a start. But right now, our definition includes watching the clouds pass by. While entertaining, this isn’t likely to be really productive. Unless you’re into weather or agriculture.

Hmm…

Is this monitoring? Or is it simply a random, unrelated stock photo?
  • Monitoring is about observing things important to your business.

“That seems more prudent”, we say, as we install a camera at the front door of our business. Should angry customers come banging at the door, we know something is wrong with our website.

But alas, something doesn’t feel quite right.

We then turn to xkcd as we mull the following words of wisdom:

“Strictly speaking, it’s better than the alternative, yet clearly someone is doing their job horribly wrong.”

Hmm…

  • Monitoring is about efficiently and promptly observing things important to your business

Now, that’s better. Telling us our system has been inoperative for a whole week probably fails to count as monitoring. But let’s raise the bar even further. Let’s try to figure out what qualifies as good monitoring. To that, we shall turn into another source of internet wisdom — inspirational quotes.

“Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.” — Clifford Stoll

Observing things important to our business will definitely yield data. What we really want is to use our (newly acquired) wisdom to prevent damage and fix whatever damage already happened.

Good monitoring will give us just that. Instead of data, it should guide us to the source of the problem and hopefully with the right insights about fixing it. This also exposes another aspect of monitoring for us to consider — the human factor (that’s the ‘we’ part)

  • Good monitoring is monitoring which allows a human to derive insights about your business’s operation, in order to prevent or minimize damage.

Now that’s really lovely, but it’s still a bit too broad. It fails to guide us as to what kind of insights we’d be looking at (especially considering a tech business, as this is a tech blog after all).

So, what could go wrong with a tech system?

  • It could do things it wasn’t supposed to do, or fail to do the things it was. Let’s call this one ‘correctness’.
  • Maybe it’s trying to do the right things, but does not meet the service level it’s supposed to, like being too slow. I’d like to call this one ‘smoothness’, based on the expected experience.
  • Maybe it’s doing the right things smoothly, but costing us too much money while doing so. Most businesses like to earn money eventually and don’t operate on an infinite budget. I’d like to call this one ‘efficiency’.
  • Another important bit which is unfortunately often neglected, is keeping in trust. A system really should be secure. I’d like to vote for ‘security’.
  • And, while optional for some, abiding by the law that governs your business should be in. Let’s throw in ‘compliance’.

And last but not least, we should do at least a cursory examination of what would ruin our day as we try to figure out if something is really a problem.

Is this a problem? Or is it simply a random, vaguely related meme template?

As with any form of attempt to discern a truth (“is this a problem?”), we’d be looking at false positives and false negatives.

A false negative in our context means there was a problem, but we failed to see it or we failed to properly assist a human to derive the right insight.

A false positive in our context means there was no problem, or perhaps a minor problem, and we caused a human to derive the wrong insight.

And thus, we end up with the following three definitions to serve us for the rest of the series:

  • Monitoring is about efficiently and promptly observing things important to your business
  • Good monitoring is monitoring that allows a human to derive insights about your business’ operation, in order to prevent or minimize damage.
  • Good tech monitoring is monitoring that allows a human to ensure your business operates correctly, smoothly, efficiently, securely and in compliance while reducing false positives and false negatives to a minimum.

In our next posts, we’ll use this as a basis for examining the key concepts of observables in a monitored tech system — indicators and synthetics, metrics and thresholds, events, transactions and business process monitors.

Picture credits: wikimedia, unsplash, meme fair use

A Gil, of all trades. DevOps roles are often called “a one man show”. As it turns out, I’m not a man and never was. Welcome to this one (trans) woman show.