BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Many IT pros impacted by the global Azure outage this week were disappointed in how long it took Microsoft to deliver an explanation, and some are re-thinking their use of Microsoft's cloud services.
The Azure outage, which was caused in part by a performance update to Azure Storage, took some websites down for up to 10 hours in some areas, and denied some access to a variety of Microsoft’s own services, including Visual Studio, Virtual Machines and Search.
Despite the problem being solved and explained, some users gave the company failing grades on crisis management.
"Our site went offline, as did about every other site on Azure, for four hours," said Ray Suelzer, senior Web developer and data guru with Making Change at WalMart, a worker advocacy group. "We lost huge amounts of traffic to our page, and that obviously sucks. But, what I find even worse about this whole experience is how poorly the crisis was managed by Microsoft."
Suelzer and others were at a loss to explain why one of the largest high tech companies in the world didn’t use its Twitter account to let the thousands of businesses affected worldwide know what the status of the event was, or just simply replying to their messages.
Instead, Microsoft directed people to a status page that said "everything looks good" or "we are having issues, but aren’t sure if they are impacting customers," Suelzer said.
"Even more alarming is the fact that nearly 24 hours after this major outage, I have yet to receive an email from Microsoft explaining the situation," Suelzer said.
The outage gave another user, who is thinking of migrating to Office 365 early next year, pause.
"Microsoft may have ambitions for hosting other people’s businesses, and could have the compute and storage muscle to back that up, but this outage makes me hesitate a bit," said one IT manager with a New York City-based retailer of children’s clothing. "I couldn’t afford to be down for a day and with little explanation to pass on to my customers and suppliers.”
Because many users have Service Level Agreements (SLAs) as part of their Azure contracts, some think they should be compensated to cover the negative impact the outage caused to their business. In their explanation of what caused the outage, Microsoft officials made no mention of such compensation. But a source close to the company said company officials were at least considering a business credit.
"Some internally have argued in favor of a service credit with this sort of out outage because of its business impact," the source said. "Others don’t think the company should be on hook for this given the amount of time most companies were down. Upper management hasn’t decided yet."
What made it difficult for Microsoft to give users a timely and detailed update on the status of services such as Azure Storage Services was the unavailability of the Service Health Dashboard and Management Portal. In his blog post, Microsoft Azure Corporate Vice President Jason Zander points out that with Service Health Dashboard down, updates were not being communicated for about the first three hours of the outage.
Also, some downstream support tools that depend on the Service Health Dashboard were affected. This resulted in limiting users’ ability to create new support cases, as well as Microsoft’s ability to update users that were impacted, Zander wrote.
What caused the global Azure outage?
In delivering the explanation for the cause of the outage, Zander said the company tested the performance enhancement update for Azure Storage for several weeks which resulted in measurable improvements.
But as the company rolled it out across its storage service, it uncovered a technical issue that resulted in blob front ends going into an infinite loop – something that didn’t surface during testing. This in turn resulted in an inability for the front ends to take on additional traffic, which then caused other services built on top of Azure Storage to experience a number of issues.
"Once we detected this issue, the change was rolled back promptly, but a restart of the storage front ends was required in order to fully undo the update. Once the mitigation steps were deployed, most of our customers started seeing the availability improvement across the affected regions," Zander wrote in his blog.
Other Microsoft services that were knocked offline included SQL Import/export. Websites, Azure Search, Service Bus, Virtual Network, Active Directory StorSimple and Azure backup Services.