How much instrumentation is enough / when am I done?
I wrote this some time early 2024 as a combination pep talk/call to action for an engineering team that had services with varying levels of observability. It didn’t have any identifying markers and I love a good pep talk, so read on.
Jon Stewart had a rant talking about “protecting democracy” on maybe his first or second show back hosting The Daily Show, describing it like this:
… the work of making this world resemble one that you would prefer to live in is a lunchpail fucking job, day in and day out where thousands of committed, smart and dedicated people bang on closed doors and pick up those that are fallen and grind away on issues until they get a positive result. Even then, have to stay on to make sure those results hold. The good news is I’m not saying you don’t have to worry about who wins the election; I’m saying you have to worry about every day before it and every day after, forever. Although, on the plus side, I’m told the sun will run out of hydrogen.
That is very much the same as building an observable system. Engineering is a lunchpail job. It requires motivated individuals to imagine a better world and also to work hard to make that imagined world real.
No Free Lunch
Industry has often landed on the terms “observability continuum” or “observability journey”, but let’s be clear: there are no endpoints (not observable/observable, start/end) here. The instrumentation you currently have might be enough to explain things you’ve experienced up until now. It’s not enough for tomorrow, though: You’ll be iterating and growing your system; Organizational practices surrounding observability often drift. You have to keep working it.
There’s a term in bike racing: “the washing machine effect”. It’s meant to convey just how turbulent being in the bunch is. There are constant shifts in pace, obstacles in the road, corners, and other riders employing their own strategies and tactics. This term has been co-opted by leadership coaches because – and this is where it applies here – the saying goes “if you’re not moving forward, you’re moving backward.”
This is a great way to think about observability: You have several engineers working on your system. They’re all making changes. You also hopefully have many, many users using your system, with more onboarding all the time. Users are great at finding ways to break (or even just to use) your system that you never imagined. Lovely little chaos monkeys <3
Nearly every change (not just code!) that affects a system either aids in making the system more observable, or reduces the system’s observability. Some changes manage to do both at the same time, but we’ll imagine those as two smaller changes: one increasing and one decreasing. If you aren’t moving forward - working to make the system more observable - you are moving backward - allowing the system to become less observable.
Start From Failures
The easiest way to make sure you’re putting in the work to move forward is pretty simple, and works from day 1: Whenever your system fails, check whether your observability tooling can help you understand how it got into that state. If there are holes, plug them. If there are no holes, work on making it easier to understand the data you do have – if you had to bounce between traces and logs, work on reducing that bouncing around (e.g., by putting more data only in logs into spans.)
Your goal isn’t just to introduce a signal that will tell you what happened. Your goal should also be to introduce a variety of signals that will help you figure out how or why it happened. Not just a span error “The database query timed-out”, but also a database lock time per query shape. If you focus only on the what, it leads inexorably to an ever growing list of out of date dashboards and alerts and runbooks. Adding how or why (and to whom) signals will help you later on with further investigations. It also aids in building up and maintaining intuition about the system and its changing emergent properties.
The Work is Worth It
Observability requires work, but it’s work worth doing. Advocate for it. Make time for it. Be kind to your future self, to your coworkers, and to your users. Keep moving forward, lunchpail in hand. At least until the sun runs out of hydrogen. <3