Microsoft's struggles to recover from its most severe Azure outage in years should serve as a reminder that IT shops must take disaster recovery into their own hands.
Severe weather in the early hours of Sept. 4 caused a power spike at Microsoft's Azure South Central U.S. region in San Antonio. That surge hit power cooling systems, and subsequent rising temperatures triggered automatic hardware shutdowns. Nearly three dozen cloud services, as well as the Azure status page, bore the brunt of the storm.
Microsoft restored power later in the day. But even up to two days later, some services were still not operational. The company said U.S. customers may encounter "data access and latency issues." And as of 7 p.m. EST on Sept. 6, Microsoft had yet to declare an ETA for total recovery.
Microsoft declined to comment, and it pointed to the Azure support Twitter feed for updates. The Azure status page was updated every few hours, with promises to continue to restore and improve service availability. As of 5 p.m. EST, the company said the majority of services were available.
Eye on Microsoft's 'infrastructure dependencies'
A source with direct knowledge of Azure's infrastructure said its cause was largely due to extensive failover to Azure data centers from both internal Azure systems and customers' attempts to redeploy services manually.
Visual Studio Team Services (VSTS) relies heavily on the Azure South Central U.S. region, according to the product group's social media updates. The outage's scope also extended to customers outside that particular region.
Microsoft said it will likely publish a postmortem about this Azure outage shortly, but a corrective action plan (CAP) could take several weeks, the Azure infrastructure source said. Cloud providers have revealed more of late in their incident postmortems, as Microsoft itself exemplified earlier this year. However, cloud providers are typically not transparent about describing the inner workings of their infrastructure.
The VSTS team also described unspecified "internal infrastructure dependencies" that made the outage worse. Both domain name system and Azure Active Directory are widely distributed across all data centers, the Azure infrastructure source said.
It's unclear to what extent Microsoft understands those unspecified internal systems' dependencies and their potential impact. Microsoft will watch the ripple effect of Azure Active Directory dependencies as a result of this outage, the source said. "I am curious to see what the CAP will be for that," he said.
At about 5 p.m. EST on Sept. 6, Microsoft acknowledged more issues with Azure Active Directory that cause "intermittent client-side authorization failures" for multiple services, though it was unclear if this was related to the outage.
Azure catches up with availability zones
Public cloud disruptions for extended periods are rare, but they happen. In the past year, Microsoft has stumbled in Europe, AWS had a S3 outage, and Google had occasional hiccups. But it's been nearly three years since an Azure incident of this magnitude.
Much of the problem lies in how Microsoft has built out its public cloud architecture, where most Azure regions are comprised of a single data center, said Lydia Leong, an analyst at Gartner.
Lydia Leonganalyst, Gartner
By contrast, AWS and Google deploy multiple availability zones in a region, and any hit to a regional data center won't wipe out the whole region. AWS said it has 55 availability zones in its 18 geographic regions, with at least three physically separate data center locations in each region.*
Earlier this year, Microsoft opened up its first Azure Availability Zones with three physically separate locations, but this level of resiliency is unavailable for most regions, including the South Central U.S. region. Beyond the task to build and connect more data centers, Microsoft must also modify its software to accommodate a multi-availability-zone architecture, Leong said.
Single data centers concentrate the risk of failures from many events, but providers have options to mitigate problems. Data center power availability or quality events are not uncommon, but multiple utility power lines, isolation transformers or voltage suppressors can address these issues.
Redundancy for control systems is also fairly common, said Chris Brown, CTO of the Uptime Institute in Seattle. However, the biggest mitigation plan is on-site personnel to provide manual intervention when something goes down. Keep in mind that this Azure outage occurred at 2:30 a.m. local time.
"A lot of hyperscalers run very dense workloads, and temperatures can rise rapidly and reduce the time to respond," Brown said. "The biggest protection from that is trained, capable personnel on site."
Take control of your cloud reliability
For customers, it's another reminder that it's their responsibility to ensure their workloads run smoothly in the cloud and to not rely entirely on the provider for disaster recovery. That means building resiliency into their applications and adding multiregion availability to safeguard redundancies.
"Disaster recovery is not something the provider does in the cloud," Gartner's Leong said. "If you want to run in a single data center [cloud region], it's just like on premises: [You need] DR recovery."
* Changed after publication as region, not zone.