Learn from these Microsoft's Azure outage postmortem takeaways

Senior Features Editor

Microsoft shed more light on last week’s major Azure outage that generally confirm what everyone already knew – a storm near the Azure South Central US region knocked cooling systems offline and shut down systems that took days to recover because of issues with the cloud platform’s architecture.

But the reports also illuminate the scope of systems damage, the infrastructure dependencies that crippled the systems, and plans to increase resiliency for customers.

What we know now

The storm damaged hardware. Multiple voltage surges and sags in the utility power supply caused part of the data center to transfer to generator power, and knocked the cooling system offline despite the existence of surge protectors, according to Microsoft’s overall root-cause analysis (RCA). A thermal buffer in the cooling system eventually depleted and temperatures quickly rose, which triggered the automated systems shutdown.

But that shutdown wasn’t soon enough. “A significant number of storage servers were damaged, as well as a small number of network devices and power units,” according to the company.

Microsoft will now look for more environmentally resilient storage hardware designs, and try to improve its software to help automate and accelerate recovery efforts.

Microsoft wants more zone redundancy. Earlier this year Microsoft introduced Azure Availability Zones, defined as one or more physical data centers in a region with independent power, cooling and networking. AWS and Google already broadly offer these zones, and Azure provides zone-redundant storage in some regions, but not in South Central US.

For Visual Studio Team Services (VSTS), this was the worst outage in its seven-year history, according to the team’s postmortem, written by Buck Hodges, VSTS director of engineering. Ten regions, including this affected one, globally host VSTS customers, and many of those don’t have availability zones. Going forward, Microsoft will enable VSTS to use availability zones and move to whatever regions support them, though the service won’t move out of geographic regions where customers have specific data sovereignty requirements.

Service dependencies hurt everyone. Various Azure infrastructure and systems dependencies harmed services outside the region and slowed recovery efforts:

The Azure South Central region is the primary site for Azure Service Manager (ASM), which customers typically use for classic resource types. ASM does not support automatic failure, so ASM requests everywhere experienced higher latencies and failures.
Authentication traffic from Azure Active Directory automatically routed to other regions which triggered throttling mechanisms, and created latency and timeouts for customers in other regions.
Many Azure regions depend on services in VSTS, which led to slowdowns and inaccessibility for several related services.
Dependencies on Azure Active Directory and platform services affected Application Insights, according to the group’s postmortem.

Microsoft will review these ASM dependencies, and determine how to migrate services to Azure Resource Manager APIs.

Time to rethink replication options? The VSTS team further explained failover options: wait for recovery, or access data from a read-only backup copy. The latter option would cause latency and data loss, but users of services such as Git, Team Foundation Version Control and Build would be unable to check in, save or deploy code.

Sychronous replication ideally prevents data loss in failovers but in practice it’s hard to do. All services involved must be ready to commit data and respond at any point in time, and that’s not possible, the company said.

Lessons learned? Microsoft said it will reexamine asynchronous replication, and explore active geo-replication for Azure SQL and Azure Storage to asynchronously write data to primary and secondary regions and keep a copy ready for failover.

The VSTS team also will explore how to let customers choose a recovery method based on whether they prioritize faster recovery and productivity over potential loss of data. The system would indicate if the secondary copy is up to date and manually reconcile once the primary data center is back up and running.