There was a widely reported outage with Microsoft services on January 25. But like with most API powered service issues, the reporting, the provider reporting, and the data don’t always align.
Our canary service Serinus, currently in open beta on Twitter, first reported the problem at 07:18 UTC with a concern notification, suggesting something was going wrong. This escalated to a major outage impacting Asia, and then it started spreading.
But first: Kudos to Microsoft
They announced there were issues within 10 minutes on their Twitter channel and in their admin centers for Microsoft 365.
That’s atypical. We often see a 30–40-minute delay between our service detecting a problem and the problems being honestly reported by the provider.
One of the key benefits of using a solution like Serinus and APImetrics is you’re always going to be ahead of the pack when it comes to notifications. As a group of admins in Australia noted on Twitter overnight.
Which Microsoft APIs were impacted?
This was an interesting outage in that it was intermittent. It was possible to use some of the API service during the issue, although we did experience connection errors and timeout.
We also noticed that our monitoring of the O365 APIs (these being the deprecated APIs that are now ‘fronted’ by the Microsoft Graph APIs) fared worse than the MS Graph endpoints.
While there have been some valid complaints that the Office endpoints are faster to use than the Graph endpoints, it is clear the marginal loss of latency (it’s a few 10s of ms per call) is made up for in service resilience.
How long were the Microsoft APIs impacted?
At APImetrics, we saw that the issues with the APIs that underpin Microsoft services had resolved themselves by 9:33 UTC, just over 3 hours later, although the reporting suggests the outage was just over 4 hours.
This raises some interesting questions about how the apps that use the API services interact with those APIs and with the greater Microsoft ecosystem that exists outside of APIs. Clearly that took a little longer to recover? Or are people just rounding up the outage?
We would love to hear more.
Top Tips
- Don’t rely on company service pages – follow @serinusmonitor
- Monitor the services that are critical to you using APImetrics
- Outages are rarely across all systems and all endpoints – some will be impacted more than others. Know your exposure
- Be prepared for a slow return to normal. In this case, we saw the APIs return to normal faster than the services
- Check clouds and regions – you might be able to quickly redeploy a service and avoid a regional outage and minimize disruption
Don’t hesitate
APIs impact all parts of your business operations and everything you build. If you’re not on top of the performance and quality issues that could hit you in a meaningful and trusted way, then you’re not really on top of the API.
The APImetrics lifecycle governance platform offers a range of cloud independent features for all types of API users and API Product Managers – whether it’s performance and Service Level Agreement checks through to schema validation and regional performance analysis, we have you covered.
We’ll even set things up for you for a trial.
Contact us for a demo today or dive in yourself.