When it comes to your SLA, no service has 100% uptime. Things need to be maintained. It takes time to update something. There will be periods where something, somewhere has gone wrong. We know this, and we think we accept it – but do we?
Many services are expected to come with some form of Service Level Agreement (SLA) and defining that for services that you provide or consume is a challenge for every organization on the planet, no matter how large they are. Google offers an SLA on their Cloud Compute platform. It’s complicated, but you can get refunds on your monthly spend on events that last longer than about 4 minutes a month – that’s an uptime of 99.99%.
That seems fair? Right?
Well, looking at the fine print, it then goes on to put the onus on the user. You have to notify them, you have to provide logs showing that there was a problem and how long it lasted in order to get your credit.
It also clearly states it applies to the backend instances. So what if the problem is elsewhere in the delivery chain? I only raise this because we’re into Day 2 of a problem with the Google Load Balancer which is affecting about 1.5% of all the calls we’re making through the Google network.
But it’s more complicated than that. It’s affected something like 10% of the calls made from locations in India and Singapore and working just fine pretty much everywhere else.
So, how do you measure what matters? And, when there’s actual money on the line, how do you prove what you need to know?