BACKGROUND IMAGE: iSTOCK/GETTY IMAGES
Cloud outage figures from this year show more mature public clouds are better equipped to avoid outages, but there...
are some surprises.
Cloud vendors have pumped large sums of cash and strategy into adding resiliency to their platforms. Discounting the low-end public clouds, uptime has been much improved, with one major exception, said David Linthicum, senior vice president of Cloud Technology Partners, a cloud consulting firm based in Boston.
"Even though [public cloud providers] are expanding quickly, they seem to be smarter at operating their business, with the possible exception of Microsoft, which has made some dumb mistakes," Linthicum said.
Among major public cloud vendors, Amazon EC2 has maintained the best uptime over the past year with a total downtime of 2.43 hours across all regions, according to CloudHarmony, a company based in Laguna Beach, Calif. that conducts independent, third-party monitoring of cloud vendor uptime.
Microsoft Azure, which had a highly publicized cross-region outage on Nov. 18, had the most downtime in compute services among major vendors at nearly 40 hours, according to CloudHarmony.
"Some services have been around for longer and are a bit more stable than others because they've been through their rough periods and ironed out the kinks more than the others," said Jason Read, founder of CloudHarmony.
The uptime improvements can be attributed to experience, additional data centers for failover, more automation, better internal communication and an improved ability to spot patterns that lead to outages, Linthicum said.
Providers spend a lot of money to maintain service and have become more proactive because a string of outages will be top of mind as enterprises consider total cost of ownership when shopping for cloud.
AWS on top
Amazon Web Services (AWS) has had its share of high-profile outages in recent years, but this year, all has been quiet on that front, according to partners.
"We've seen some service impacts and some slowness, but no downtime with any of our customers that I'm aware of," said Kris Bliesner, CTO of 2nd Watch, Inc., an Amazon partner and cloud consultancy in Liberty Lake, Wash.
The company had planned to build an application that could serve as an early warning system for its customers when there was an outage. But those plans have fallen to the bottom of the development priority list, Bliesner said.
"We just don't see as many outages anymore," Bliesner said.
In part, this may be because AWS has developed expertise in designing highly reliable infrastructure at scale, and gone through the kind of growing pains that are now affecting less mature cloud providers, Bliesner said.
That was one of the messages from AWS vice president and distinguished engineer James Hamilton, who presented on Amazon's innovation at scale at this year's re:Invent conference.
Amazon has come to design its own networking, storage and server equipment, which has lowered costs and improved reliability, Hamilton said.
"Enterprises give lots of complicated requirements to networking equipment producers who aggregate all of these complicated requirements into tens of millions of lines of code that can't be maintained, and that's what gets delivered," Hamilton said in his presentation. "We don't use all that stuff … the answer to why our gear is more reliable is that we didn't take on as hard a problem."
Jason Readfounder, CloudHarmony
Amazon is also "religious" about improving its infrastructure monitoring metrics on a weekly basis, which has also improved reliability, Hamilton said. AWS' Availability Zone (AZ) system connects multiple data centers together within AZs which are synchronously mirrored for high availability. Services such as the Relational Database Service (RDS) also offer multi-AZ replication, which increases the number of redundant locations where data is stored.
AWS customers have learned from experience about creating more resilient applications too. When RDS was first introduced, 26% of customers used multi-AZ replication. That number is now 40%, and the goal is for it to be 70%, according to Hamilton.
Newer databases in the AWS cloud, such as Aurora, offer even more resiliency thanks to an overhaul of the underlying storage engine, which exists in Aurora separately from the main database and can be recovered quickly in the event of a failure. Aurora also triple-replicates data, making six copies across AZs.
Amazon's data center design has also been refined to offer optimal reliability, according to Hamilton. Data centers hold a maximum of 50,000 to 80,000 servers.
"We can easily build bigger, but … as they get bigger there's a risk -- if something goes wrong, the loss is too big," Hamilton said.
Because AWS has learned from experience how to optimize availability as it scales up, AWS competitors who entered the infrastructure as a service market later may still see the kind of highly publicized outages Amazon used to, according to Bliesner.
"At some point, if Azure or Google want to compete, they're going to have to make a scalability leap, and are customers at higher risk for outages during that scale up?" he said. "My guess is they are."
Amazon is behind Google's cloud in one area, however. Google Cloud Storage had eight outages for a total of 14.23 minutes, while Amazon S3 had 22 outages for 2.66 hours, according to CloudHarmony.
Cascading errors happen, and when major outages hit public cloud vendors, it's typically due to human error, rather than a hardware infrastructure failure, said Jonah Kowall, an analyst with Gartner, Inc., based in Stamford, Conn.
"They take all the best practices to try to avoid these issues, but [outages] are going to happen with a complex system that is undergoing change," Kowall said. Enterprises typically move slower because they over-engineer their infrastructure and processes, Kowall said. Cloud offers a bit of a "catch-22" because the attraction is speed and agility, but smaller, less thoroughly vetted update cycles can come with bugs that create problems for customers, he added.
Scheduled reboots are often a cause of downtime for compute, which can be an indication of poorly managed infrastructure, Read said.
"Vendors have outages," Read said. "The good ones are going to do thorough investigation of what the root causes were and either change policies or software to make sure those exact set of events don't occur again."
And learning from those mistakes often helps across platforms, according to Paul Voccio, Rackspace Hosting vice president of software development.
"As the industry matures, everyone is learning from each other how to operate their servers at scale and in a supportable manner," Voccio said.
Cloud Downtime 2014
AWS EC2: 2.43 hours
CenturyLink Cloud Servers: 26.02 hours
CloudSigma: 7.48 hours
Google Compute Engine: 4.43 hours
Joyent Cloud: 2.6 hours
Microsoft Azure: 39.63 hours
Rackspace Cloud: 7.52 hours
Source: CloudHarmony's CloudSquare
At Rackspace's San Antonio headquarters, Voccio has two large screens at his desk to monitor the company's public cloud metrics. And while other emerging areas of cloud get more attention, few things get as much focus internally as maintaining uptime.
"The customers do expect us to be up and running all the time," Voccio said. "We take it very seriously."
Rackspace, which claims 99.999% uptime across all data centers since 2009, has weekly meetings to discuss system performance and to ensure scheduled maintenances don't conflict. The company has emphasized resiliency and redundancies in each data center, and it has learned that isolating clusters is critical to quickly diagnosing problems and ensuring they don't cascade to the rest of the data center, Voccio said.
Rackspace's compute cloud was down for 7.52 hours across all regions over the past year, according to CloudHarmony. The company came under the spotlight when it had to conduct a reboot due to a Xen hypervisor bug and faced criticism over its handling of the situation.
It's hard to tell customers there is an issue that has to be resolved but that they can't talk to them about it because of an embargo, Voccio said.
Rackspace often cites its so-called fanatical support as a differentiator, but Voccio says he tells his coworkers it's even better when customers get what they want without having to turn to the support staff.
"While, yes, we're there to answer the phones, most customers would prefer not to call," Voccio said.
Transparency still a hurdle
Providers do offer several weeks' worth of uptime information on their websites, but none of the vendors contacted by TechTarget provided year-over-year figures.
Vendors hesitate to disclose information and some won't mention when little blips in the system or partial outages occur, Read said. There also can be issues with the reliability of status pages, which is compounded when vendors host their own sites and an outage wipes out the monitoring dashboard for customers.
"Part of the issue we see is many of the prominent enterprise cloud providers limit what you can do to verify if they're working properly, and that's especially true of SaaS," Kowall said.
Most people try to emulate a user by setting in software that logs in and does a couple actions every few minutes from multiple places around the world to ensure functionality, but vendors don't like that because it puts extra load on the system, Kowall said. Vendors try to limit that through contracts and, most likely, they don't want people to hold them accountable for stability, which is concerning, he added.
"You have to negotiate with them what you're allowed to do within their system," Kowall said.
Vendors should be doing more to drive transparency down the cloud layer so customers know what's going on with the system, Rackspace's Voccio said.
"Customers want to see lower down the stack," Voccio said. "It creates hesitancy so we're exploring ways to provide more transparency for the entire stack."
Google declined to be interviewed for this report, but provided a statement saying the company is committed to making Google Cloud Platform reliable.
Representatives for Microsoft declined to comment for this report.
Prepare for outages with cloud risk management
Build redundancy into your cloud outage strategy
Detect public cloud outages using SaaS tools