Magic Numbers and second guessing SLOs - why is 96% better than 95%?
I’ve had a half written draft of this post sitting in a folder for the last six months and I’ve not been able to shake the root cause so I’m going to publish it and see what the feedback teaches me. But first the heresy - Service Level Objectives make me uncomfortable.
I have no issue with the idea that you need some form of measurement and tracking to ensure you’re maintaining an acceptable level of service but when reading posts on SLOs, or watching recorded conference sessions, the concept seems to imply some rigour and background process to determine the numbers to work towards that feels decoupled from any hard details and often comes across as either a guesstimate or just a Current Representation of Actual Percentages.
Maybe it’s the industry I’m currently in and the unique position it affords. As a Government department there’s no explicit per transaction financial goals. Cheaper and quicker is better but if we go over a per page serving cost of say 5p per page no one loses their job or brand new Mercedes bonus pot. A few of my previous roles were incredibly financially focused and knew how much they should be making every minute broken down by day and hour usage trends. I can see how service level objectives work in that world and I’d be a lot more comfortable with the concepts adoption if every use had something like this behind it:
Let’s assume we work for ${ecommerce}.com and we always exactly hit our SLO in terms of successful requests and never go above or below it. Hopefully it’ll be quite easy to pull together our opening position.
- Our SLO is 95%
- We get 100,000 visits a day
- That’s 95,000 successful visits
- We make a sale in 5% of visits
- Which means we make 4,750 sales in a day
- Our average sale is 100USD so we make 475,000 USD a day
With this kind of detail I can see how SLOs can help inform the discussion:
- Our new SLO will be 97%
- That’s 97,000 successful visits.
- We still make a sale in 5% of visits
- Which means we make 4,850 sales in a day
- Our average sale is still 100USD but now we make 485,000 USD a day
In this scenario if we raised our SLO, and our game to match it, we would make an extra 10,000 USD each day. These numbers make it easier to build a case and decide how much engineering time and financial outlay we are willing to commit to possibly making that extra money and it shows what happens if we backslide in availability
Unfortunately very few of the SLO posts and talks I’ve seen give any kind of breakdown like this. Instead they seem to rely on either picking an acceptable amount of downtime and encoding that or taking the current uptime percentage and elevating it beyond happenstance into a guiding principle. I guess my question would be: “If you changed your SLO value up or down 1-5 percentage points, what does that mean? What happens?”
Maybe I’m looking for more than Service Level Objectives are offering or maybe the early adopters are all finding their way through it and the pioneering posts are deliberately simplified either to encourage people to take a look or hide the complexity of actual numbers that have no context outside of their home organisation but considering all the cheer leading I’d expect there to be more than just a helpful baseline that people can have in the back of their mind. So, where does that leave me? Continually questioning the process while trying to write enough SLOs that I either develop Stockholm syndrome or it finally clicks and that little nagging voice in the back of my head stops saying its just reliability theatre.