Microsoft recently came clean about the root cause of its Windows Azure outage and promised a 33% reduction in many customers’ monthly bills. While this generally impressed analysts, partners
The company fixed the Azure service disruptions relatively quickly and released a "root cause analysis" on March 9, more than one week after the Feb. 29 Azure outage. Microsoft made efforts to go beyond promises in its service-level agreements (SLA) for recompense, but some complained those moves were a bit tardy.
Further, one source with access to the Azure team and knowledge of what has become known as the "Leap Year Outage" said it has cost Microsoft some credibility -- and money.
"Azure is losing some customers [over the outage]," the source, who requested anonymity, said, adding that the outage was "not crippling" for users.
But most industry watchers say the outage wasn’t significant enough to cause customers to jump ship.
No data was lost, but it's embarrassing that one line of code could cause an outage.
Rob Sanfilippo, analyst.
Roger Jennings, a Windows Azure MVP and developer said the outage only affected one of his demo apps and that was only down for 35 minutes. “I doubt if Microsoft lost any customers over the leap day outage; the same is true for Amazon with their last extended outage," Jennings added.
One analyst agrees with Jennings.
"I haven't heard of any customers leaving due to this," said Rob Sanfilippo, research vice president at Directions on Microsoft, an independent analysis firm based in Kirkland, Wash.
Microsoft did not respond to questions regarding whether the company lost customers from the outage at time of publication.
The Azure outage began late in the afternoon of February 28 (00:00 on February 29, Greenwich Mean Time) when an SSL security certificate that had not been properly coded to deal with the extra day in the month failed, causing a rolling service outage. In response, Microsoft technicians disabled cloud management services globally to keep customers from damaging running processes.
The upshot was virtual machines (VM) that were already running would continue to run but could not be managed, and new VMs could not be started. Technicians got most of the systems repaired within about 12 hours but, rushing to put a fix in place, they inadvertently caused a secondary outage that stretched over 24 hours for some customers in three major sub-regions -- North Central U.S., South Central U.S., and North Europe.
Ten days later, Bill Laing, corporate vice president of Microsoft's Server and Cloud Division, posted the incident post mortem on the Windows Azure Team Blog. Besides providing a blow-by-blow description of the incident, including the human errors, and discussion of steps Microsoft is taking to avoid future problems, he also announced the customer rebates.
"We have decided to provide a 33% credit to all customers of Windows Azure Compute, Access Control, Service Bus and Caching for the entire affected billing month(s) for these services, regardless of whether their service was impacted," Laing said.
That's a fair offer which, while it doesn't reimburse customers for lost income, is at least more than the payouts guaranteed under Azure's SLA terms, Sanfilippo said. "No data was lost, but it's embarrassing that one line of code could cause an outage."
Anecdotally at least, some customers bear out that assertion.
"We did not lose any income, data or receive any complaints from clients," said John Anastasio, partner and CTO at KGS Buildings, an Azure customer, adding that he was generally "open minded" about the outage.
Mark Eisenberg, director at Microsoft Silver Partner Fino Consulting, said most cloud customers recognize that cloud computing is still nascent, and tend to be more forgiving of outages due to cloud technologies' early adopter status.
"Coming clean after the fact was the right thing to do," Eisenberg said. "It was just a bad day."
Stuart J. Johnston is Senior News Writer for SearchCloudComputing.com. Contact him at firstname.lastname@example.org.