Helder Almeida - Fotolia

News Stay informed about the latest enterprise technology news and product updates.

Azure outage spotlights cloud infrastructure choices

One difference in Microsoft's cloud infrastructure design may have contributed to the extended outage this week in an Azure region that hit a large number of customers.

Microsoft's struggles to recover from its most severe Azure outage in years should serve as a reminder that IT shops must take disaster recovery into their own hands.

Severe weather in the early hours of Sept. 4 caused a power spike at Microsoft's Azure South Central U.S. region in San Antonio. That surge hit power cooling systems, and subsequent rising temperatures triggered automatic hardware shutdowns. Nearly three dozen cloud services, as well as the Azure status page, bore the brunt of the storm.

Microsoft restored power later in the day. But even up to two days later, some services were still not operational. The company said U.S. customers may encounter "data access and latency issues." And as of 7 p.m. EST on Sept. 6, Microsoft had yet to declare an ETA for total recovery.  

Microsoft declined to comment, and it pointed to the Azure support Twitter feed for updates. The Azure status page was updated every few hours, with promises to continue to restore and improve service availability. As of 5 p.m. EST, the company said the majority of services were available.

Eye on Microsoft's 'infrastructure dependencies'

A source with direct knowledge of Azure's infrastructure said its cause was largely due to extensive failover to Azure data centers from both internal Azure systems and customers' attempts to redeploy services manually.

Visual Studio Team Services (VSTS) relies heavily on the Azure South Central U.S. region, according to the product group's social media updates. The outage's scope also extended to customers outside that particular region.

Microsoft said it will likely publish a postmortem about this Azure outage shortly, but a corrective action plan (CAP) could take several weeks, the Azure infrastructure source said. Cloud providers have revealed more of late in their incident postmortems, as Microsoft itself exemplified earlier this year. However, cloud providers are typically not transparent about describing the inner workings of their infrastructure.

The VSTS team also described unspecified "internal infrastructure dependencies" that made the outage worse. Both domain name system and Azure Active Directory are widely distributed across all data centers, the Azure infrastructure source said.

It's unclear to what extent Microsoft understands those unspecified internal systems' dependencies and their potential impact. Microsoft will watch the ripple effect of Azure Active Directory dependencies as a result of this outage, the source said. "I am curious to see what the CAP will be for that," he said.

At about 5 p.m. EST on Sept. 6, Microsoft acknowledged more issues with Azure Active Directory that cause "intermittent client-side authorization failures" for multiple services, though it was unclear if this was related to the outage.

Azure catches up with availability zones

Public cloud disruptions for extended periods are rare, but they happen. In the past year, Microsoft has stumbled in Europe, AWS had a S3 outage, and Google had occasional hiccups. But it's been nearly three years since an Azure incident of this magnitude.

Much of the problem lies in how Microsoft has built out its public cloud architecture, where most Azure regions are comprised of a single data center, said Lydia Leong, an analyst at Gartner.

If you want to run in a single data center [cloud region], it's just like on premises: [You need] DR recovery.
Lydia Leonganalyst, Gartner

By contrast, AWS and Google deploy multiple availability zones in a region, and any hit to a regional data center won't wipe out the whole region. AWS said it has 55 availability zones in its 18 geographic regions, with at least three physically separate data center locations in each region.*

Earlier this year, Microsoft opened up its first Azure Availability Zones with three physically separate locations, but this level of resiliency is unavailable for most regions, including the South Central U.S. region. Beyond the task to build and connect more data centers, Microsoft must also modify its software to accommodate a multi-availability-zone architecture, Leong said.

Single data centers concentrate the risk of failures from many events, but providers have options to mitigate problems. Data center power availability or quality events are not uncommon, but multiple utility power lines, isolation transformers or voltage suppressors can address these issues.

Redundancy for control systems is also fairly common, said Chris Brown, CTO of the Uptime Institute in Seattle. However, the biggest mitigation plan is on-site personnel to provide manual intervention when something goes down. Keep in mind that this Azure outage occurred at 2:30 a.m. local time.

"A lot of hyperscalers run very dense workloads, and temperatures can rise rapidly and reduce the time to respond," Brown said. "The biggest protection from that is trained, capable personnel on site."

Take control of your cloud reliability

For customers, it's another reminder that it's their responsibility to ensure their workloads run smoothly in the cloud and to not rely entirely on the provider for disaster recovery. That means building resiliency into their applications and adding multiregion availability to safeguard redundancies.

"Disaster recovery is not something the provider does in the cloud," Gartner's Leong said. "If you want to run in a single data center [cloud region], it's just like on premises: [You need] DR recovery."

Ed Scannell, senior executive editor, contributed to this story.

* Changed after publication as region, not zone.

Dig Deeper on Public cloud and other cloud deployment models

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

What cloud services do you use that require high availability?
Please tell me how I was supposed to make VSTS have resiliency into Microsoft's application and add multi region availability to safeguard redundancies when it was down across the board.  I'm failing to see how this falls on the customer.  
It's down, fingers point to Microsoft and we wait and wait and wait....
Agree. This is a real problem. People think because you move to the cloud that DR is inherent and that is a big mistake. Likewise many cloud providers today sell DR as an added benefit and it is misleading. DR is specific to each business and it is up to you to address it. Your systems should have specific recovery time objectives (RTOs) and recovery point objectives (RPOs) that are defined by and for your business given a disaster. You should be sure to communicate the RTOs and RPOs to the cloud provider and ensure they can meet your requirements. You should also ask how often they perform DR failover testing in the cloud and if you can see the results from when they performed their last DR test. If in Healthcare, per Hippa, they should be performing this testing at least once each year. If the cloud is down and will not be back in time to satisfy the RTO of your most critical systems, you must declare a disaster and invoke your DR plans and recover elsewhere. Cloud is just another platform you leverage for resiliency and you must account for it experiencing a disaster. DR and resiliency is all about not placing all your eggs in one basket.
Thank you for calling this out. Resource group provisioning, configuration etc. was down across the board. I fail to see how anything the customer could have done would have changed this scenario.
Your welcome. The customer must think about whether or not they are using the best platform(s) and brainstorm how to best protect their business when disasters like this strike. Clearly they should now add this risk to their business disaster risk profile. DR is about recovering from the unexpected no matter what. If the customer had a DR strategy that leveraged Azure as their current environment but could fail over to something like Amazom for DR, to protect "critical systems", could that have been leveraged? There may be a lot of complexity here but I believe this is how organizations should be thinking when it comes to surviving their business.  If you have a data center, your DR planning worst case scenario would be for loss of the entire data center. If your "data center" is Azure, Amazon or any of the other cloud providers, you must plan for the worst case scenario. The key here would be to have DR plans and infrastructure in place that have been tested at least once each year and you know it works. Obviously there is added cost and leadership must determine if they should invest to mitigate this risk. These are just some thoughts I have on this and hopefully are useful.