This content is part of the Essential Guide: An enterprise guide to Microsoft Azure cloud

IT pros disappointed in Microsoft response to Azure outage

Microsoft’s slow response to the recent Azure outage left some users wondering if they should entrust critical business data to the cloud environment.

Many IT pros impacted by the global Azure outage this week were disappointed in how long it took Microsoft to deliver an explanation, and some are re-thinking their use of Microsoft's cloud services.

The Azure outage, which was caused in part by a performance update to Azure Storage, took some websites down for up to 10 hours in some areas, and denied some access to a variety of Microsoft’s own services, including Visual Studio, Virtual Machines and Search.

Despite the problem being solved and explained, some users gave the company failing grades on crisis management.

"Our site went offline, as did about every other site on Azure, for four hours," said Ray Suelzer, senior Web developer and data guru with Making Change at WalMart, a worker advocacy group. "We lost huge amounts of traffic to our page, and that obviously sucks. But, what I find even worse about this whole experience is how poorly the crisis was managed by Microsoft."

Suelzer and others were at a loss to explain why one of the largest high tech companies in the world didn’t use its Twitter account to let the thousands of businesses affected worldwide know what the status of the event was, or just simply replying to their messages.

Instead, Microsoft directed people to a status page that said "everything looks good" or "we are having issues, but aren’t sure if they are impacting customers," Suelzer said.

"Even more alarming is the fact that nearly 24 hours after this major outage, I have yet to receive an email from Microsoft explaining the situation," Suelzer said.

The outage gave another user, who is thinking of migrating to Office 365 early next year, pause.

"Microsoft may have ambitions for hosting other people’s businesses, and could have the compute and storage muscle to back that up, but this outage makes me hesitate a bit," said one IT manager with a New York City-based retailer of children’s clothing. "I couldn’t afford to be down for a day and with little explanation to pass on to my customers and suppliers.”

Because many users have Service Level Agreements (SLAs) as part of their Azure contracts, some think they should be compensated to cover the negative impact the outage caused to their business. In their explanation of what caused the outage, Microsoft officials made no mention of such compensation. But a source close to the company said company officials were at least considering a business credit.

"Some internally have argued in favor of a service credit with this sort of out outage because of its business impact," the source said. "Others don’t think the company should be on hook for this given the amount of time most companies were down. Upper management hasn’t decided yet."

What made it difficult for Microsoft to give users a timely and detailed update on the status of services such as Azure Storage Services was the unavailability of the Service Health Dashboard and Management Portal. In his blog post, Microsoft Azure Corporate Vice President Jason Zander points out that with Service Health Dashboard down, updates were not being communicated for about the first three hours of the outage.

Also, some downstream support tools that depend on the Service Health Dashboard were affected.  This resulted in limiting users’ ability to create new support cases, as well as Microsoft’s ability to update users that were impacted, Zander wrote.

What caused the global Azure outage?

In delivering the explanation for the cause of the outage, Zander said the company tested the performance enhancement update for Azure Storage for several weeks which resulted in measurable improvements.

But as the company rolled it out across its storage service, it uncovered a technical issue that resulted in blob front ends going into an infinite loop – something that didn’t surface during testing. This in turn resulted in an inability for the front ends to take on additional traffic, which then caused other services built on top of Azure Storage to experience a number of issues.

"Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions," Zander wrote in his blog.

Other Microsoft services that were knocked offline included SQL Import/export. Websites, Azure Search, Service Bus, Virtual Network, Active Directory StorSimple and Azure backup Services.

Dig Deeper on Cloud pricing and cost optimization

Join the conversation


Send me notifications when other members comment.

Please create a username to comment.

In 2015, I'm surprised that we're not seeing more end-user companies using tools like "Chaos Monkey" or all of the Simian Army tools - Clouds will fail, just like Data Centers fail, but now there are more ways to prepare yourself for various failures. 
They want use to use the newest technologies and we adapt to them and move on. Then they fail with issues and leave us hanging. When our day to day business is not possible it makes us consider another choice. It also damages reputations on both sides. There services  and ours as well.
I am still seeing major issues with Microsoft's cloud.  A number of my system's are consistently losing the connection to it and when it does it corrupts the file.  This has to do with office storing data to the cloud and also the credentials that the systems use.
Thanks for sharing your experiences about the difficulties with Azure. I am curious, have any of you experienced damage to the Windows registries as a result of an outage, particularly the last one which was nasty? And if so, were you able to repair the damage yourself or did you need to call Microsoft tech support in?