If there was one major lesson learned from last month’s AWS outage, it’s that cloud failures are going to happen....
IT pros and other experts say it’s best to plan for them so you can respond quickly when they do occur.
Depending on the criticality of the application and the data, those plans could range from running redundant servers in the customers' own data center to setting up networks using multiple cloud providers.
"Prepare for failure," said David Blinder, founder and CTO of LiveFamily, a Facebook app for genealogical research and a division of Bellevue, Wash.-based Intelius Inc.
LiveFamily, which runs on Amazon Web Services (AWS) infrastructure, was hit by two June outages, although neither was catastrophic. That was partly because the company uses RightScale Inc.’s Cloud Management package, which reroutes workloads and network traffic to different cloud vendors if the situation warrants it and the customer is willing to pay.
If you do the upfront work right, a failure is an incident and an emergency but not a disaster.
Jesse Robbins, chief community officer and co-founder of Seattle-based Opscode
However, providing high levels of resiliency is not free. Experts caution that customers need to decide up front which applications are truly critical.
IBM’s Business Continuity and Resilient Services Group quizzes prospective customers about what they deem critical, prior to setting up hosting services, said Rich Cocchiara, a distinguished engineer at IBM.
"[We say] 'Let’s determine the level of service you need to have, service-level objects and agreements that give you the kind of service that you want’,” Cocchiara said. “And, by the way, not all business processes and applications are created equal."
Protecting against system outages can be as simple as keeping private cloud equipment in the customers' data center. Or it might require setting up a mirror site in a different AWS Availability Zone; it could be as complex as running multiple cloud platforms.
"We were [impacted], but our caching system saved us by switching back to local [processing and storage]," said Colin Dean, president of the Pittsburgh LAN Coalition, a group that organizes video gaming events. "Having some kind of failsafe is ideal, even if your site fails over to a spare.”
AWS outages magnify the importance of cloud
Both of AWS’ June outages were due to electrical failures. The second began when massive electrical storms on the East Coast triggered a previously unforeseen problem in the electrical backup systems. In turn, the outage took down one of the company's Availability Zones in the US East-1 Region, AWS' largest with at least 10 data centers in the region.
For several hours during the night of June 29 into June 30, a number of large sites such as Pinterest, Netflix and Instagram were unavailable.
Additionally, the outage caused problems that made AWS' "control plane" bog down -- compounding the issue.
For its part, Amazon has said it will recertify or replace the backup generation equipment that failed to perform properly, as well as adjust hardware parameters such as how long to wait for power fluctuations to stabilize before cutting over to generator power.
AWS said this latest outage did impact a "significant" number of customers, although only one, a cloud-based dating site,WhatsYourPrice.com, said that it was switching cloud providers in light of these failures.
AWS did not disclose how many customers or users were hit by the outage.
LiveFamily was fortunate enough to plan ahead and run AWS instances in different Availability Zones but, with the extenuating problems, "got bitten a little," Blinder said.
RightScale Cloud Management, Opscode Chef simplify recovery
Besides providing other cloud automation features, products including RightScale's Cloud Management and others, such as Opscode Inc.'s Chef, can help simplify customers recovery from outages.
"Failure is going to happen," said Jesse Robbins, chief community officer and co-founder of Seattle-based Opscode, which fields the Chef cloud infrastructure automation products.
Similar to RightScale, Chef supports a variety of cloud platforms, which include AWS, OpenStack, Microsoft Windows Azure, and the company just disclosed support for the Google Compute Engine.
"Tools like Chef [let you] automatically failover to another cloud provider or provide your own," Robbins added.
"If you do the upfront work right, a failure is an incident and an emergency but not a disaster."
Jeremy Przygode, co-founder and CEO of Los Angeles-based Stratalux, Inc., and an AWS reseller, is an Opscode customer. Stratalux, which provides cloud-based managed services, also had customers affected by the late June outage, but he takes the outages in stride.
"Things break," Przygode said.
My background tells me cloud is absolutely a tool in the arsenal for protecting against these types of outages.
Rich Cocchiara, an engineer at IBM
Running multiple clouds -- complex but effective
For those who really can't afford downtime, tools such as those from RightScale, Opscode, and others enable customers to run multiple clouds, but it's complicated.
"It's really hard to do," said Kyle Hilgendorf, principal research analyst at Stamford, Conn.-based researcher Gartner Inc. "You have to keep mirrored copies of the exact apps stack with another provider and then you have to figure out how to fail over if one of these goes down"
In fact, many experts argue that implementing mission-critical applications in the cloud can actually help prevent outages.
"My background tells me cloud is absolutely a tool in the arsenal for protecting against these types of outages -- just by its very nature it affords companies certain abilities that they didn’t have before," IBM's Cocchiara said.
"So cloud makes it not only affordable but because it is also relying on multiple cloud centers, gives users the ability to decide what level of risk they want to take," Cocchiara added.
Stuart J. Johnston is Senior News Writer for SearchCloudComputing.com. Contact him at email@example.com.