Monitoring theory, from scratch — events and transactions.
In our previous posts, we’ve covered indicators and metrics. These were attributes of a system which we could define and observe. But we’ve also noticed that they have major limitations around analysis. Indicators supposedly circumvent this by virtue of definition and metrics by the operator’s analysis capability… surely there is something that has more context and detail. Assuming a problematic metric is flagged, we now need to look at the trail of happenings that preceded it. In fact, perhaps just knowing what were the events prior to the metric getting flagged would have sufficed and been more efficient.
A system has all sorts of stuff happening in it all the time. It could be as detailed as ‘register AX changed its value to 0’ or ‘a packet has been received’ or it could be as high-level as ‘a customer has just logged in’. Some of these might be expected, some of these might not be (or perhaps expected but not welcome, in the case of nasty third-party integrations.). By directly observing these, we don’t need to wait for them to trigger a metric change.
Congratulations, we’ve just re-invented logging! Well, sorts of.
Similar to metrics, collecting each and every event in the system is hopelessly impossible. We need to look at which events might matter to us and start collecting them. But now we have extended ability and context to make sense out of them. In particular when we consider that we are able to collect additional data surrounding the event of interest — like which code path it took place in.
Sounds perfect. Up until your operators are overwhelmed by millions of log lines scrolling by their terminals.
Again, knowing what to log and when to log it is an art. In particular, the people involved in monitoring might not be the people defining what is logged and how. Logging has developed a lot from the early days of Unix system logging facilities and the Syslog protocol and yet the demand from the market is for ever-improving systems. A little tiny company called Splunk owes a lot of its 22.0B market cap (at time of writing) to innovation in the field of logging.
But, no silver bullet again for us. Bummer. We’ll have to make do with some recommendations. As before, these are my own takes on the matter and more in the form of general rules of thumb rather than specifics (and again, learned from logging systems used such as Splunk and Coralogix). And now, takeaway time again!
- Iterate and perform RCAs. Haven’t we heard that one before? Your logging should look more like a living organism than a stone. Ask yourself if a log could have been helpful in diagnosing a problem. Do I need more logging? Less logging? Better filtering? Better aggregation?
I’d also like to add — be brave and send logging patches for underlying products or your cloud provider. Be a good netizen. Thank you.
- Collect more, display less. That one also rings a familiar bell. Allowing an operator to access the data they need for a particular analysis but not always displaying can be the best of both worlds. As a particular recommendation over the lessons of metrics, abilities such as template analysis and aggregation of repeat messages can be of great assistance in separating the wheat from the chaff, offering both a metric view of repeating events as well as the ability to filter items of interest.
Then again, considering the number of things that can possibly be collected, a tasteful limit is in order. Usually, intuition coupled with an RCA process will suffice.
- Context matters. A lot. This is a new development over our experiences with metrics and thresholds. We can now say things that are far more specific — including things like stack traces, line of code triggering the event, some particular environment state, particular metrics or indicators… this extended richness and expressivity should be tapped in event collection. This is something to consider in your RCA process as you improve upon your log collection.
As if that’s not overwhelming already, there is more. We probably shouldn’t be too much of a surprise to guess or learn that somewhat similar to synthetics, there is a form of aggregated distributed event collection — transaction tracing.
Transaction tracing, at its core, attempts to look at user interactions with the system as events. This has merit — users are the ones experiencing our system as well as usually the ones breaking it by trying to use it. It only makes sense to examine events from the user’s perspective. But a user transaction (let’s say a login event) can span multiple internal components. Transaction tracing aggregates that into a single view and may be able to collect further context such as related metrics and logs.
Now that really is close to perfection, if done right. We can immediately determine which interaction broke, where it broke, how did it break and with the relevant metadata and related events all laid out in a tree form. This concept has also helped facilitate small companies such as New Relic and its 4.0B market cap (at time of writing). I’ll include a screenshot from two such favorite products (Lumigo and Epsagon) which have a cloud/serverless focus, as I think they’re worth a thousand blog words:
Note how this view traces through code paths and different components (both cloud and our own) while aggregating relevant log entries and localized metrics and indicators. Bravo.
Transaction tracing has its real-world limits, of course. These are complex and potentially expensive products and tracing is neither computationally free nor perfectly accurate. Then again, I trust that it’s bound to make your developers and operators happier.
Looks like we’ve just won another takeaway: you should evaluate a transaction tracing solution based on its accuracy, performance impact, and pricing if implemented in your environment. That’s the ‘done right’ we’ve been looking for when it comes to transaction tracing.
So now, we finally have all the tools we need to construct a comprehensive monitoring solution for our system to make sure our business is operating as it should with confidence.
I mean, really, is there anything else we haven’t looked into?
Turns out there is. Tune in for the last (phew!) part about business process monitoring.
Photo credits: Pexels. Coralogix, Lumigo, Epsagon reviewed under fair use