Last week's Amazon Web Services outage has exposed a deep divide between what users have come to expect from online services and what cloud computing is able to deliver.
If your systems failed in the Amazon cloud this week, it wasn't Amazon's fault.
George Reese, CTO of cloud management toolmaker EnStratus
A dramatic scenario that was a microcosm of the regulation-necessitating telecom crises of the last century, Amazon's lengthy Elastic Block Store (EBS) outage took out not only popular websites like Reddit but also an array of online service providers like Heroku that do not have alternative sources to draw on.
"The problem is we have no alternative location to host our database or anything else," said Lee Buescher, CEO of Architectural Overflow, a firm that sells blueprints and architectural planning services to builders and architects around the country.
Buescher's built his business on Heroku, which provides an application platform that doesn't require Buescher to run or maintain his own servers. He said that, until last week,the convenience and the cost were both worth it.
Now his confidence is badly shaken; Buescher says he's still bullish on cloud but mistakenly assumed that Heroku, recently bought by Salesforce.com for $212 million, had its ducks in a row when it came to operations. But the cloud development platform is just as beholden to a crisis at its own provider, Amazon Web Services (AWS), as he is to them.
"I assumed incorrectly they knew what they were doing," Buescher said. "I've never been out that long. I host other [Ruby on Rails] applications and we have never had outages."
He added that he sympathizes with Heroku's troubles and has some empathy for AWS operators that had to fix the cascade of failures, since he has a background in data center operations. At the same time, this is not something that should have transpired.
"I have no question those guys were losing a lot of sleep, so I feel bad for them," he said. "But at the same time, I'm paying for it, so yeah, I expect it to work."
Buescher said he's actively looking at alternatives to Heroku, including hosting his own servers and a disaster recovery plan for his applications, but he's primarily waiting for Heroku to announce how it's going to prevent this from happening again. He's most concerned about the lack of communication.
Buescher is a user of many online services, such as ERP Software as a Service (SaaS) NetSuite and collaboration service 37signals.com, but they've shown a dramatically different level of consideration for users when things go pear-shaped.
"They have an outage and they're communicating, even though their stuff isn't business critical," he said.
Amazon users express outage-related outrage
Buescher (and many others, judging by comments and reactions on Amazon support forums and social media sites) is most frustrated by the lack of consideration and outreach from Amazon during the outage. He said everyone is going to have an outage -- that's life -- but the feeling of uncertainty and lack of communication are deeply demoralizing. He added that he didn't really understand why the levels of customer response he got from his other service providers, and the levels of response he tried to give his own customers, didn't apply to AWS.
That uncertainty led to some AWS customers resorting to grotesque tricks to get attention and support: one user posted that the outage was a potential life-threatening medical emergency for hundreds of cardiac patients.
"Sorry, I could not get through in any other way," read the post. The user said they ran three instances on AWS to monitor ECG signal for hundreds of home patients and hadn't been able to see those signals since April 21.
I'm paying for [Heroku's cloud platform], so yeah, I expect it to work.
Lee Buescher, CEO of Architectural Overflow
The shocking post drew an intense barrage of criticism from other users for such an astounding display of negligence. Medical IT systems, especially those that may be involved in a life-threatening event, are held to very high standards for redundancy and availability; if ever found out, the user could have been criminally negligent. That user later backed off the claims that the monitoring was critical or that patients could have been put in danger and disappeared, but the drama highlights the desperation users felt during the outage due to the lack of response from AWS. AWS staff did not respond to the user's original post.
Others used Amazon's forums to discuss alternative providers and services like Rackspace and Azure. The consensus was that there were viable options that might include the missing element of properly handled business support.
"Feature-wise, Azure is getting close to AWS," said Stephane Legay, CTO of online collaboration service NextSlide. "They've got queuing, storage, CDN, relational DB as a service, distributed caching as a service, compute nodes, EBS-equivalent (Azure drive) and now video streaming figured out, and Microsoft support is usually pretty damn good."
Almost a week after the outage began, confusion and disappointment in how AWS handled it is still a problem. Amazon posted somewhat-detailed regular reports on its status page (which runs from Amazon.com, by the way; the retail's giant's website doesn't run on the AWS cloud), but has not made a public statement about the problem that, during its 12-hour peak, might have knocked out thousands of websites and businesses.
Thorsten von Eicken, CTO and co-founder of cloud management service provider RightScale, criticized the status updates in a blog post. He said, in part, that they led RightScale to make suboptimal decisions about how to respond to the crisis.
"In hindsight, we should have intentionally failed-over our master database, which was operating in the 'impacted availability zone,' early on," he said, but Amazon did not provide enough details to make that determination.
Amazon's cloud outage in detail
The outage began April 21 at around 1:30 AM PT (8:30 AM GMT) and knocked out the EBS services at Amazon's flagship data center in Ashburn, Virginia. Somewhere around 12 hours later, AWS said that functionality to all its Availability Zones but one had been restored and were operating normally. By 7 PM on April 24, it said the service was stable and the recovery process was ongoing, but it wasn't until April 25 that AWS marked the US-EAST Region as operating correctly (although there is an ongoing snarl in the Relational Database Service).
There's a miniscule number of Relational Database Service users as compared to EBS; why was it so important? EBS is an add-on feature to Amazon's primary compute and storage clouds that users really like, but it adds a potentially fatal weakness into the AWS environment.
EBS provides stateful virtual storage volumes that users can keep even as they start and stop virtual machine instances. It's much more useful to a user than Amazon's Simple Storage Service (S3); as far as an operating system is concerned, EBS works just like a hard drive. Users fire up a virtual machine instance and connect their floating EBS volumes, as if they were plugging in an external hard drive. Instances can be started with plenty of local storage, but that disappears when the instance disappears, making scalability problematic for many types of applications.
Consequently, EBS has become a staple of EC2 users as a kind of replacement for the storage area network (SAN). But, as many have pointed out, cloud computing is not a traditional data center in design or practice. Replicating old architectures on something like AWS means that unexpected weaknesses are exposed. There have been an outpouring of reminders from numerous Amazon-based-or-bred cloud veterans and providers have reminded the IT world of his fact, but it's apparently not an easy lesson.
Look before you leap into the cloud
"EBS is not a SAN," wrote Stephen Nelson-Smith, technical director at IT solutions provider Atalanta Systems. He said users had to see EBS for what it was: a mockup of traditional, highly available fiber-backed SANs. It runs entirely over the network; if the network is saturated, your storage is not available. He added that sites like Reddit, hit hard by the outage, had a massive, improper and dangerous overreliance on EBS that came around to bite them. Other victims, like Heroku, apparently just had the bad luck to have their instances in the worst-hit part of the infrastructure.
Nelson-Smith said users had to look at everything Amazon offered for reliability services and think for themselves when it came to expectations of service, something echoed by many.
"In short, if your systems failed in the Amazon cloud this week, it wasn't Amazon's fault," said George Reese, CTO of cloud management toolmaker EnStratus, in a blog post. "You either deemed an outage of this nature an acceptable risk or you failed to design for Amazon's cloud computing model."
Reese said that proper use of AWS requires that you "design for failure," i.e., expect and anticipate disasters, and do not expect AWS to help you with any issues.
Feature-wise, Azure is getting close to AWS.
Stephane Legay, CTO of online collaboration service NextSlide
There were several notable exceptions to the outage. Many very large consumers of Amazon's cloud services -- such as Priceline, Netflix, SmugMug and Zynga -- did not suffer any serious downtime. They may have noticed a few hitches and some alarming performance logs, but all remained functional throughout.
SmugMug's Don MacAskill wrote a high-level review of why they were able to withstand the blow. First, he said, his company spread their service across Amazon's Availability Zone; second, they assume service failures will happen; and "third, we don't use EBS."
So why the fresh excitement about what is apparently a settled debate on how to properly use cloud computing? The audience has grown massively; cloud tracking website CloudSleuth estimates that more than a third of all the Web traffic in the world now passes through or touches a resource operated by AWS. It's all administered, however, by individual, self-service AWS consumers that were vulnerable to this outage.
If past outages are any indication, this won't slow adoption of Amazon Web Services or cloud computing one bit. But it remains a forceful reminder that users need to understand that cloud is a very different, very new and occasionally very uncertain way of doing things, and deploying to the cloud has to reflect that.
AWS has declined to comment thus far, except to say it is preparing a detailed postmortem. Heroku has just posted its own postmortem that reads in part: "HEROKU TAKES 100% OF THE RESPONSIBILITY FOR THE DOWNTIME AFFECTING OUR CUSTOMERS." It also states that the platform will be prioritizing backup and availability across multiple regions and reducing EBS usage.
UPDATE: AWS has released an explanation and a timeline for the days-long failure in one of its Availability Zones. A backup router was improperly configured during routine upgrades, and when the primary router was taken off line for the upgrade, it caused a cascading failure of the EBS system.
AWS says that the outage was corrected quickly, but the recovery took so long because it was out of fresh hard drive space on its storage arrays and it did not want to follow standard procedure of re-using existing capacity until it was sure data was not going to be lost. A fraction of the affected EBS volumes may not be recoverable. It is looking to increase both the spare, overhead capacity needed for disaster recovery and more automation in its upgrade procedures to avoid a repeat.
It has also promised 10 days worth of service credit for EBS for affected users and offered an apology.
Carl Brooks is the Senior Technology Writer for SearchCloudComputing.com. Contact him at firstname.lastname@example.org.
Dig deeper on Amazon Web Services cloud