In what has become a regular occurrence, Amazon Web Services experienced an outage in its U.S. East Region Monday.
This outage, which caused problems with Elastic Block Storage and Relational Database Services, took out a number of highly visible customers, including popular websites such as Flipboard, Foursquare and Pinterest.
Thus, yet again, Amazon's failures were high profile, the kind of thing that has led to its reputation -- and by extension, the reputation of public cloud in general -- as unreliable among some enterprise observers.
Tracking Amazon Web Services' ups and downs
How to prevent cloud system outages
Will Amazon's outage cause would-be cloud adopters to run screaming?
Amazon outage due to data center generator problems
Still single? Blame the recent Amazon EC2 outages
But in this case, other industry watchers are saying that what might to some look like a fundamental lack of reliability actually had to do with choices some Amazon customers made.
Building in redundancy to avoid outages is the only way to protect business assets in an outage like Monday's, but it's a decision that is made on a case-by-case basis, according to Kent Langley, vice president for Amazon partner SolutionSet, which had several customers affected by the outage.
"[It] requires implementation of [a] disaster recovery and business continuity plan and implementation, which is often ignored or deemed overly expensive and an unacceptable cost relative to risk," he said. "Only a very small number of our clients are usually willing to maintain a multi-region deployment for their businesses."
And in fairness, there may be very good reason for this. After all, not being able to pin on Pinterest or check in on Foursquare is hardly a crisis of national security; it's almost certainly not worth it in a lean, mean, Web business model to build in costly enterprise-style redundancy.
Take Yipit, for example, a Web company that filters daily deals from sites like Groupon. The company's primary database was affected by the most recent outage, leaving the company to restore from backup Monday night.
This was a direct result of a decision the company made not to architect its database to span multiple availability zones, according to Andrew Gross, developer operations for the company.
"The past few outages, we did not get hit, and we knew we had pretty much dodged a bullet because at this point we didn't feel it was worthwhile to spend the extra engineering effort to try and get something that could potentially avoid this," Gross said. "We're just kind of dealing with it and accepting it as a fact of life."
For organizations such as Foursquare and Pinterest, it might also be more lucrative to prioritize new features over redundancy, according to Damian Bramanis, director of advisory services for Sentinus, a cloud computing consultancy located in Perth, Australia.
Quality, reliability and security is a joint responsibility between the enterprise and the cloud service provider.
Damian Bramanis, director of advisory services for Sentinus
"I wouldn't be surprised if this was a conscious choice," he said.
But the fact that some Web-based businesses don't prioritize redundancy the way enterprises do doesn't mean adequate redundancy can't be built in where necessary. Nor are there many excuses for building truly critical applications in a single Availability Zone, particularly in the U.S East Region, which has been the epicenter of most Amazon failures for the last five years.
"Quality, reliability and security is a joint responsibility between the enterprise and the cloud service provider," said Bramanis. "Amazon … [has] made options available for building best-practice, reliable, fault tolerant service, [and] it is up to the enterprise to make use of those options."
"At this point, I'm tempted to start pontificating about human nature," when it comes to the continued outages associated with Amazon single availability zone failures, and the perceptions to which they lead, said Carl Brooks, analyst for Boston-based Tier1 Research.
"Most people recognize how to consume responsibly," Brooks said. "Some don't, and they get all the attention after they wrap their server around a telephone pole, metaphorically speaking."