The Microsoft Azure AD outage and what we can learn from it
Microsoft made headline news on Monday as Microsoft 365 services, Teams and various Azure services went down for several hours. The public root cause analysis ( "Authentication errors across multiple Microsoft services (Tracking ID LN01-P8Z)" ) shows some striking similarities between this outage and the authentication issue that took Google services down in December 2020, which I already discussed in a post on this site.
As most of the world either is either migrating or has already migrated their services onto the top-3 cloud infrastructure platforms (AWS, GCP and Azure), it's worthwhile to take a closer look at what we can learn from the incident.
Global single points of failure: Authentication and authorization
Cloud infrastructure services are usually deployed independently in multiple regions and availability zones, the idea being that those instances can operate independently and therefore issues in one instance will not affect any other instances. While also deployed globally, authentication services are an exception to that rule: security policies require that they deliver consistent answers on a global scale, and any issues will have immediate global effect. And nearly every service depends to some degree on authentication services, in order to allow user logins or verify trust relationships with other services.
That means that issues in authentication services such as Active Directory (AD) have a massive "blast radius": if anything goes wrong in those services, applications around the globe will be affected. It's not a coincidence that Google's issue in December 2020 also started in a user authentication service.
Security goals and resilience goals can be at odds
Given the importance of AD, it's interesting that Microsoft currently only gives an availability SLA of 99.9% (allowing about 43 minutes of downtime per month), with the free tier of Azure AD getting no SLA at all. But at least Microsoft has recognized that the ability to log in and verify tokens is critical base functionality, and will raise the SLA for these scenarios to 99.99% as of April 1st 2021:
"the most critical promise of our service is ensuring that every user can sign in to the apps and services they need without interruption. "
Even with a better SLA, technology can always fail. Is it worth considering a backup authentication mechanism? Of course, allowing user logins when the central user directory is down is an extremely difficult proposition from a security perspective. But Microsoft itself is also putting an emergency mechanism in place: "In that September incident we also referred to our rollout of Azure AD backup authentication"
Something is always in a state of migration
In the discussion of Google's outage report, I had already noted:
"In an always-on environment, there's really no alternative to migrating systems partially, in several steps, while keeping the system running at all times. Of course, that approach comes with its own risks."
In Microsoft's statement, the root cause described like this:
"a particular key was marked as "retain" for longer than normal to support a complex cross-cloud migration. This exposed a bug where the automation incorrectly ignored that "retain" state, leading it to remove that particular key"
Complex cross-cloud migrations are hopefully not the normal state of AD operations, but it's certain that extraordinary circumstances are more likely to expose hidden bugs, such as the incorrect processing of the "retain" state in Microsoft's key removal process. But given that something is always in a state of migration at any given moment, that means that more care must be given in the software development process to edge cases and special circumstances that might occur only in a transitory period like a migration. In this case, it's probably fair to say that no automated tests were in place to check whether keys in "retain" state for longer than a certain time period were handled correctly.
It's not the first time
"A previous Azure AD incident occurred on September 28th, 2020 and both incidents are in the class of risks that will be prevented once the multi-phase SDP effort is completed"
It's good to hear that Microsoft is working on increased fail-safety for Azure AD, but given that both major incidents belong to the same class of risks, does that mean that something like this happen again before the SDP project is complete mid 2021?