A five-hour disruption within one of Amazon.com's data centers last week left some customers high and dry but not seriously disgruntled. In what Amazon characterized as a freak accident, lightning struck a facility and damaged the power supply to several racks of servers, downing a number of Amazon Web Services (AWS) Elastic Compute Cloud (EC2) server instances. But AWS users appeared to take it in stride.
Outage not crippling for users
Wilson said the outages were undesirable but not catastrophic for his business. JamLegend is an online social gaming site where users compete to play songs with their keyboard, similar to the video game Guitar Hero. A year old, JamLegend boasts almost half a million users and is still in beta. Wilson said the company's main Web server was down for several hours, and the site was unavailable for about 40 minutes.
Wilson's only gripe was that he always felt left in the dark when outages occur. He detailed another outage where he wasn't notified that one of his instances was running on degraded hardware until after his own troubleshooting had failed. "Why let us boot an instance on failed hardware? And why didn't we receive an email when the hardware failed?" he said.
Eric Hammond, the VP of technology at Campus Explorer Inc. and an AWS user who was unaffected by the outage, said that Amazon needs to educate its customers about exactly how it enables large-scale, fault-tolerant architectures better and cheaper than anybody else.
"If your vendor provides five disks to put in a RAID and you put all your data on a single disk, you'll look silly when you complain RAID doesn't work because the disk failed," he wrote in an email. Hammond added, "Amazon's SLA [service-level agreement] doesn't kick in until two whole data centers in the same region are completely unavailable. A single data center wiped off the globe is not a problem. Customers have to think bigger and use the tools provided."
Other users were similarly unfazed by the issue. One did not even consider it an outage, while others said they were more likely to experience outages through their own misuse of the service than from anything Amazon did.
Customers "have an absolutely legitimate desire for transparency," said Adam Selipsky, the VP of product management and developer relations at AWS. He cited a status page for AWS services that tracks a month's worth of data on incidents and outages and an AWS forum thread on the incident. He expressed a strong desire to give customers as much information as possible "without hurting them in the end" by exposing security risks in Amazon's infrastructure.
Amazon has been close-mouthed about its cloud, refusing to divulge the amount, type or exact location of the servers it uses. Selipsky did confirm, however, that an Amazon facility was struck by lightning, which damaged an uninterruptible power supply, which in turn caused a power distribution unit to fail, darkening several racks of servers.
"We had not seen that failure mode before," Selipsky said, explaining that backup power during the storm worked as designed, and the component failure happened as the building returned to mains power. Pressed on whether or not the building was properly grounded against lightning, he said he did not know exactly how the damage occurred but that Amazon facilities were "very much designed in light of 'best practices.'"
He said that Amazon had far better uptime and redundancy systems than the vast majority of its customers could afford and said he was proud of the company's record. According to Selipsky, new services from Amazon like CloudWatch were designed to help users avoid potential disruptions by automatically re-provisioning servers across different Availability Zones if it detects a failed instance.
"We had many customers whose instances went down last week, whose applications were unaffected because they were running in one or more Availability Zones," he said. Selipsky defined an availability zone as two physical locations that would not go down under the same disaster scenario. "It's not axiomatic," he said, but broadly true that Availability Zones are located in different data centers.
When it comes to outages in the cloud, Amazon is in good company. In May, Google's main site and all its online services went down across the U.S., causing consternation for several hours. And in January, Salesforce.com suffered an outage; and in March, Microsoft's cloud development platform, Azure, went down for almost 24 hours.
"What we'll wind up finding out is that [Amazon is] still learning" about managing and selling public compute resources, said IDC analyst Frank Gens. He said that much of the publicity over outages among various cloud outlets is growing pains and that, for practical purposes, Amazon is a good buy.
"For 80% of shops, Amazon or Google is a step up in terms of availability. Not everyone is in the top 5% or 10%" of high availability, highly secure data centers, he said. Even though Amazon lacks the transparency and guarantees of other providers, it's still the head of the pack in pricing and in customers, and the outage is unlikely to change that.