Ruby on Rails Platform as a Service startup Heroku started off the new year with a nasty surprise. Without warning on January 2, all of the specialized, high-capacity Amazon EC2 instances that run its popular application and development service disappeared in the blink of an eye. Twenty-two virtual machines, approximately $20,000 per month in hosting fees for high-memory m2.2xlarge instances, suddenly vanished, leaving Heroku's estimated 44,000 running applications in the lurch.
Amazon blamed a routing device in its Virginia data center, and the service was back up in an hour. But Oren Teich, Heroku's product developer, said this is an example of one of the many important lessons new ventures and businesses need to study before they decide to work entirely in the cloud. Traditional contingency planning doesn't go far enough, he said: expect the unexpected.
"[Normally] you need to assume that anything can fail -- where we didn't go far enough was to assume that everything can fail," he said.
Teich said that while Heroku had designed for redundant servers and failover capacity, this was a novel kind of blackout for a hosting provider. A server failing was normal, he said, but it was unheard of for a whole class of resources to suddenly vanish.
Heroku had recently moved its front-end servers onto the high memory m2.2xlarge instances, and some of those instances were already running "core back-end stuff."
Teich also said that all of Heroku's m2.2xlarge instances were running in a single availability zone, which was a mistake. He stressed that Heroku had failover built in already -- if 21 instances had failed instead of 22, or if it had spread instances across several zones, "we wouldn't be talking [about the outage]" he said.
Nevertheless, on Friday, January 2, every m2.2xlarge instance in that availability zone suddenly vanished, despite all other types of EC2 instances running as normal. That's unheard of in traditional hosting. It would be like every server with a given amount of RAM suddenly shutting down, regardless of operating system, age, brand, hardware or location in the data center, with no effect on its neighbors.
"For us, there's the stuff you plan for and then there's the stuff you don't even know about," Teich said.
An event like this was an "unknown unknown" that nobody planned for because nobody imagined it. He chalked it up to the learning process and pointed out that everybody in Amazon Web Services was flying by the seat of the pants at least part of the time.
"It's not like there's 'best practices' for cloud computing yet," he said.
EC2 expert understands cloud issues
"I sympathize with Oren!" said EC2 expert and consultant Shlomo Swidler. "It's not easy to imagine completely new ways for things to fail, especially things as complex as [EC2 and AWS]."
Swidler said more unique problems were bound to occur, but unanticipated hiccups would shrink over time as users pooled their experiences -- typical for technologies with an enthusiastic community. Until then, however, and especially in high availability systems like Heroku's PaaS service, the risks of a new frontier remained.
"We'll all learn to consider those new failure modes in our designs. Until then, the early adopters should be aware that they're accepting a certain risk," Swidler said.
On the other hand, Amazon appears to have learned from past missteps. Teich said he couldn't fault the support given by AWS, a different story than others have told in the past. Even though the fault lay with Amazon's operation, Heroku was contacted by AWS staff, who arranged for engineers from both organizations to work on piecing together the incident and preventative measures.
"They actually reached out to us," Teich said.
Teich added that despite incidents like these, which show that Amazon has its share of quirks, the ease and flexibility of the service more than make up for it.
A 15-person start up like Heroku could never support its thousands of users for a measly few million in venture capital with traditional hosting, and along with the cost benefits of using AWS, Heroku gets to blaze a trail for next-generation platforms by discovering problems traditional hosting doesn't have.
"The big lesson is that no matter how smart you are, it'll happen to you," he said.
Carl Brooks is the Technology Writer at SearchCloudComputing.com. Contact him at firstname.lastname@example.org.