The great Google outage of 2020 and what we can learn from it
The fact that a Google outage is newsworthy says a lot about Google's stability in general; Google has been operating its complex systems with high uptime and great performance. Still, even Google goes down occasionally, and the "great" Google outage on December 14th 2020 made headline news. The incident report, which to their credit they have published, holds some learnings for most IT teams, even though only Google operates at Google scale. Let's take a deeper look at the report to see what Google and other organizations that build and operate software-based systems have in common.
Something is always in a state of migration
From the incident report: "A change was made in October to register the User ID Service with the new quota system, but parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0".
In an always-on environment, there's really no alternative to migrating systems partially, in several steps, while keeping the system running at all times. Of course, that approach comes with its own risks. Coexistence of the old and the new can be problematic. As responsibilities of the old system are gradually shifted to the new systems, clear communication and monitoring is needed to make sure that everyone is using the authoritative system. Maybe keep an inventory of ongoing migration plans, and review it regularly as part of your risk management activities?
Global single points of failure
"The quota for the account database was reduced, which prevented the Paxos leader from writing. Shortly after, the majority of read operations became outdated which resulted in errors on authentication lookups"
Even when deployed independently in separate regions, some systems (in particular, authentication components) need to synchronize across as the globe, and that makes them prone to causing global (as opposed to regional) outages, since data issues can quickly spread across all regional instances. This is something they have in common with name resolution (DNS) and load balancing, other examples of centrally controlled, distributed and historically accident-prone components. With large-scale distributed databases and storage systems (BigTable, S3, …) readily available in public clouds, it's tempting to put all the data in one pot and access it from multiple regionally deployed instances of a service. But centralized data leads to a higher "blast radius" of data issues than isolated data stores. Consider reducing the risk of such issues by sharding data across multiple independent data stores.
Security goals and resilience goals can be at odds
"For security reasons, this service will reject requests when it detects outdated data."
Scalable and resilient systems usually need to tolerate inconsistent responses, since they will cache service responses and serve stale data in case the original service is not available. In most cases, accepting inconsistent or incorrect responses is a small price to pay for better availability and performance of the overall system. Not so when there are conflicting security requirements: as a general rule, very secure systems need to be very consistent, and that makes them inherently less scalable and resilient. It would be interesting to look at possible trade-offs in Google's case: how much less secure would the authentication systems be if it served outdated data for a longer period of time, or if it were built as a federation of individual authentication systems which, even in absence of a central system to coordinate updates, could keep on serving authentication requests?
Finding the needle in your haystack of metrics
"Parts of the previous quota system were left in place which incorrectly reported the usage for the User ID Service as 0".
There usually isn't a shortage of monitoring data that could point out accidents about to happen, rather there is a shortage of attention that can be given to each individual metric. In this case, it's very likely that a certain metric (the quota usage for the User ID Service) suddenly dropped to zero nearly two months before the incident happened. As such data is captured and fed into central monitoring systems, the challenge for Google (as well as any other organization that operates complex technology) is to keep track of semantics and relevance of the data, and have processes and systems in place that will detect anomalies and raise them to the attention of humans or systems that can assess and fix the problem.
There is a lot of industry interest in AIOps and anomaly detection at the moment, and undoubtedly this instance of a time series that suddenly flatlines to zero could have been caught with a fairly simple system that detects anomalies. But the more difficult problem to solve is the problem of relevance and dependencies: how would an AIOps system develop an understanding of the relevance of a "usage quota" metric, and could it predict the impact of that metric on other metrics and outcomes, such as the allocated quote after the migration grace period expires?
Keeping emergency communication systems fully independent
"CloudSupport's internal tools were impacted, which delayed our ability to share outage communications with customers […]. Customers were unable to create orview cases in the Cloud Console."
There are many examples in which organizations had to take to Twitter to inform customers why their services were down, and it's obviously less convenient to decouple internal ticketing and communication systems from, for example, your customer database. Still, the rise of externally hosted "status page" apps shows that most organizations have realized the benefits of being able to keep up some kind of organized information flow even in a major incident situation.
Arguably, the recovery wouldn't have been faster and the impact on customers wouldn't have been less if Google had been able to tell them that there was an incident, or customers had been able to report the incident themselves. Still, it's good practice to keep the incident support systems (as well as any internal communication tools) as independent as possible of other production systems, and consider them to be part of the emergency communication system that needs to be up even in the most dire circumstances.
Humans always need time to resolve an incident
"This was detected by automated alerts for capacity at 2020-12-14 03:43 US/Pacific, and for errors with the User ID Service starting at 03:46, which paged GoogleEngineers at 03:48 within one minute of customer impact. At 04:08 the root cause and a potential fix were identified, which led to disabling the quota enforcement in one datacenter at 04:22."
While 36 minutes isa remarkably fast incident recovery time for an issue rooted so deeply in a complex systems, it takes time to get people's attention, for them to get an overview of impact and possible root causes, to organize as a team, test hypotheses, develop and decide on remediation plans. That's the case even in the best companies. Once the automated recovery procedures fail, human-driven incident resolution is going to take at least half an hour - better to plan for it.
Keeping systems safe
With more knowledge of internal systems and processes at Google, there would be a lot more to learn from this incident. From the company that invented SRE and blameless-postmortems, it's no surprise that the incident report mentions various improvements that need to be made. I'm confident that Google will take the learnings from the incident seriously.
Does your organization take the time to study incident root causes and learn from them?