Monitoring theory, from scratch — business process monitoring
In our last post, it seemed like we had full error coverage within our grasp.
Why aren’t we done? What is it that stops us from calling out every error our application makes as an event and raise it?
The amorphous nature of the term ‘error’.
True, some things are pretty much obvious errors. This usually applies to technical errors such as the infamous ‘500’ HTTP return code.
But our systems don’t just have to measure to technical standards. They have to be correct from a user’s perspective. What if our login button just turned invisible, for example?
Computers and computerized systems may not find this as an error. But our users will have a very hard time logging in. Eventually, this will come up.
Business process monitoring is about catching that eventuality.
Now, we could imagine these as events and metrics and in a sense they are, e.g. “How many successful logins have taken place in the last hour?”
But they behave very differently from all our previous technical metrics because they are so amorphous, dependent on other factors, and have different people involved in their creation and maintenance.
Going back to our invisible login button example, perhaps a competing product had just launched and all our users are migrating in droves? Perhaps a story in the press frightened all our users? This ambiguity in analyzing these inputs and the entirely different domain they reside in (business analytics) is almost another discipline altogether.
And yet, we have a goal. Our goal is to make sure our system is operating according to all the terms we’ve set in our first post. It might not be easy, but now our product managers, business analysts and data scientists must work together in tandem to try and complete the picture.
A monitoring system is never complete without an analyst that flags up a potential decline in business indicators as feedback to potential business failures and with operations staff taking their seat in business-equivalent RCAs.
Business process monitoring is hypothetically the holy grail — we directly measure what we care about (our business). But as we’ve already learned along the way, it’s too vague compared to indicators, requires too much analysis compared to metrics, and lacks the discreteness of all technical measures. It’s also too slow to manifest. And yet a crucial last resort.
While this lies outside what is typically considered the DevOps domain, I have taken part in such RCAs and did my best to maintain an understanding of the business to make sure this base is covered and it’s a process I wholeheartedly recommend.
Another personal favorite of mine is the topic of email deliverability monitoring, which by definition mixes tech and business together in an inseparable fashion. All in all, however, aside from what’s been said already, this really is a custom-made suit for your business.
The key takeaway:
Focus on making sure the process exists and that operations staff takes at least some part in understanding the core business and business-impacting events.