Cloud computing outages: What can we learn?

When cloud services go down, the IT community takes notice. Find out what went wrong in the most high-profile cloud computing outages and how each company resolved its issues.

When it comes to cloud computing, users and providers alike are still exploring the unforeseen while adapting to these new infrastructures. Inevitably, issues with downtime and unexpected outages will occur. Even the biggest and best cloud companies have seen their best-laid plans go to waste as their services go dark for hours at a time.

In response, this guide on cloud computing outages should shed some light on what's going wrong and how IT managers and users can learn from each incident. We've ranked the outages according to their severity. Yellow is low, orange is medium, and red...well, if you hit red, you better have good customer service.

Microsoft | Rackspace | | Heroku | Terremark | Intuit | Amazon

Back to top ⇑

Even during a beta period, a cloud service can suffer an unexpected outage. Microsoft found this out in March of 2009, when Azure went dark for 22 hours. Only applications being tested out during the trial period were impacted, so nothing major was lost.

The Azure outage came quite early in cloud's growth, but IT managers already knew that planning for disasters and downtime in the cloud was a wise first step. Still, with Azure in its infancy, no one knew then how much impact cloud computing would have on IT, or how much impact outages would have on confidence in the cloud.


Back to top ⇑

The hoster-turned-cloud provider suffered a major cloud outage in June of 2009, when a breaker flipped, a line of generator backups failed and several racks of servers went down. And yes, that's as bad as it sounds.

To the company's credit, it updated its blog throughout and went to great pains to tweet about the entire experience, but its detractors responded in kind with an oft-used #rackspacefail Twitter hashtag.


But when Rackspace suffered its next big outage in November of 2009, there was no such response. In fact, when one of its customers was given a chance to publicly slander the provider post-outage, he instead described it as "no big deal." Meaning Rackspace either lucked out or continued to provide adequate updates and quick fixes.

Sachin Agarwal, co-founder of blogging service provider Posterous, spoke to after the outage took his business offline for 15 to 20 minutes. Agarwal was not upset, however, saying that Rackspace is "very transparent" and dealt with the issue immediately.

In this case, a mild outage brought about reassurance and a good PR bump for the company. If no serious data is lost and the service returns quickly, a happy customer usually remains happy. For all the talk of "100% uptime," most users don't seem to be thrown off by an occasional minor incident. As long as they don't pile up, of course...


Back to top ⇑

In January of 2010, almost all of's 68,000 customers suffered at least an hour of downtime.

The company reported "systemic failure" in its data center, with everything, including backup, going down for a brief period. It shed some unwanted light on's lock-in policy:, its Platform as a Service (PaaS) offering, cannot be used outside of So when has issues, so does This can get a bit sticky when service is interrupted for an extended period of time.

The outage didn't hurt the company much, though; its VMforce collaboration with VMware stirred up quite a buzz in the spring, and Marc Benioff bragged not a month past the outage that was the "biggest enterprise cloud computing company." We don't think they're sweating it too much.


Back to top ⇑

Heroku, a PaaS for the Ruby programming language, had an estimated 44,000 running applications stall when its $20,000 worth of high-capacity Amazon EC2 instances went down in January of 2010.

Amazon had the instances up again in an hour, but Heroku product developer Oren Teich had already learned his lesson. Heroku was running all of its instances in a single availability zone, which left them prime for complete service disruption, and the lack of best practices for cloud computing meant an outage like this baffled the company.

"For us, there's the stuff you plan for and then there's the stuff you don't even know about," Teich said.

Heroku learned a lesson the hard way, and while Teich couldn't fault Amazon's support post-outage, he figured out that caution is the prime directive when dealing with cloud services.


Back to top ⇑

Back in March, VMware partner Terremark put the future of vCloud Express in peril after a seven-hour outage that one affected customer said was caused by "connectivity loss." The outage only hit a reported 2% of Terremark users, but those whose services went down expressed extreme displeasure at how the provider handled the situation.

John Kinsella, founder of Terremark customer Protected Industries, called his provider a "mom-and-pop hosting company" while ranting about how the outage left him cold. In fact, Kinsella even unfavorably compared Terremark's response to Amazon's, which is telling when you factor in Amazon's early struggles with status reports and service alerts.

Of course, after months of vCloud Director hype and the excitement over its unveiling at VMworld 2010, the Terremark outage appears to have left behind little fallout.


Back to top ⇑

When Intuit's online accounting and development services crashed in June of this year, the company was left scratching its head. Every aspect of Intuit's online presence, including its main website, was out for almost two full days, and customers were amazed that such a comprehensive outage could occur in this era of backup plans and disaster recovery tools.

But that wasn't all. Roughly one month later, Intuit's QuickBooks Online service went down after a power failure. This particular outage was only a few hours, but two relatively high-profile periods of downtime in 30 days is pretty substantial in the early days of cloud computing.

But even though some users were calling for "pitchforks and torches" to be branded about, Intuit is still operating with four million customers and continuing its evolution into a PaaS and Web services provider. The company doesn't have the stature, or the post-outage uproar, of an Amazon or a Rackspace; they're mostly known for Quicken.


Back to top ⇑

But all other cloud computing outages are child's play when compared to an Amazon Web Services outage. The granddaddy of all cloud service providers, Amazon has suffered its fair share of service interruptions and bona-fide disasters in the last several years.

A freak accident in June of 2009 left some customers without five hours of Amazon EC2 service, but most of them seemed to chalk it up to growing pains. This oddly cheerful response didn't last, of course, after a distributed denial-of-service attack and a lengthy email blackout made Amazon's disaster response coordination and customer relations appear to be lacking.


Another freak accident involving a lightning storm near Amazon's Virginia data center brought systems down for six hours but also showed the company's evolution; Amazon scored major points with Jim Melvin, president of Apparent Networks. He gave Amazon high marks for response time, indicating that the company might be learning its lesson when it comes to outages.


But as cloud computing continues to evolve and expand, issues are bound to persist. A series of seemingly unrelated incidents in May, again at Amazon's Virginia data center, led to three different outages over the span of a week. The first involved an uninterruptible power supply (UPS) failing to switch to backup power and knocking out a rack of servers; the second occured four days later when a power distribution panel short-circuited and knocked out service for eight hours. Finally, a vehicle hit a utility pole two days later and cut off power to the data center for a half hour. Unrelated or not, minor or otherwise, three outages in such a short time frame is a big deal for any provider.


But throughout all this, most customers seemed to keep an open mind towards Amazon Web Services. They accepted that the complexity of Amazon's technologies would lead to unexpected issues, and more importantly, they referenced the value of working in Amazon's reasonably priced cloud environment.

Amazon, in turn, lived up to their faith and showed its maturity by offering a picture-perfect response to an April 2010 outage. A lengthy blog post was released, the AWS status page was updated periodically with information, and a wrap-up message examined the reason behind the outage and how it was resolved.



As several cloud computing customers noted during the aforementioned incidents, outages such as these frequently occur in company data centers. The difference is that these involve in-house, understood technologies; not popular, expanding, relatively unknown entities like cloud computing.

The cloud isn't perfect, and more outages will occur down the line. All the top companies can do is study what went wrong and correct the issues, lest a young upstart come along with a better track record and usurp its position as a leading cloud provider.

Microsoft | Rackspace | | Heroku | Terremark | Intuit | Amazon

Dig Deeper on High availability and disaster recovery