Helder Almeida - Fotolia

Cloud outage audit update: The challenges with uptime

Public cloud vendors improved uptime percentages in 2015, while customers have learned to better handle cloud outages and what works best in their environments.

A cloud outage gets plenty of headlines, but the reality of how it impacts customers is much more nuanced.

Users increasingly are finding ways to get around outages that occur with their providers -- or at least coming to terms with the reality that no public cloud will be close to 100% available without some heavy lifting. Meanwhile, potential customers remain keen to compare vendors' uptime consistency despite the limitations of publicly available data.

CloudHarmony, owned by analyst firm Gartner, is one of the more prominent sites to track uptime among public cloud vendors. It uses a simple, nonperformance-related methodology to track vendor availability across regions by provisioning services with providers in four categories: infrastructure as a service, compute and storage, content delivery network and domain name system.

The fairest comparison among the platforms CloudHarmony tracks is uptime percentage, because over a full year, outage totals include scheduled maintenance windows, said Jason Read, CloudHarmony founder. Some vendors also have far more regions than others, meaning a provider such as Amazon could have more outage time collectively, but customers would see better overall availability because the downtime is spread across data centers.

Generally, top vendors saw better uptimes in 2015 compared with 2014. Microsoft garnered the biggest improvement due in large part to not repeating the massive late-year outage that skewed results in 2014. Azure Virtual Machines uptime improved from 99.9339% to 99.9937%, according to CloudHarmony.

"[Cloud vendors] are continually learning from their mistakes, and over time, their services improve," Read said.

Among the biggest public cloud vendors, Amazon had the best uptime percentage for compute at 99.9985% with Elastic Compute Cloud, while CenturyLink had the worst at 99.9647%, according to CloudHarmony. Google Cloud Storage had the best storage uptime at 99.9998%.

Still, Read is quick to note that the CloudHarmony data shouldn't be viewed as a holistic view of cloud uptime, but rather a "measurement of network availability from the outside." Not all services are tracked, so some outages, such as the cascading AWS database as a service event in September, either don't register or show less of an impact than what users observed.

"[The] cloud is a very complex, intricate set of services that should be working together in various ways," Read explained. "It's just impossible to represent availability for all the infinite possible ways you might be using cloud."

Some vendors underreport outages, while others, such as CenturyLink, offer a tremendous amount of data and push transparency even with the smaller outages that customers might not notice, Read said.

Sometimes, outages aren't the fault of the cloud provider. The most severe, frequent and underreported outages are network-related, he added.

Uptime for Google Compute Engine actually fell slightly in 2015, from 99.9859% to 99.9796%, according to CloudHarmony. While Google didn't provide specific figures, company officials claimed they achieved considerable reliability improvements in 2015, according to internal metrics. Tracking cloud uptime is but one measurement of the complexity of network, compute and storage all working properly, said Ben Treynor Sloss, vice president of engineering at Google. Customers could be using 20 instances at any given time, and one of those could have a maintenance issue without impacting the overall performance of the application. If that lagging instance is the one service being tracked, unavailability would be over-reported at the server level and availability underreported at the system level.

Other issues, such as autoscaling and load balancing, come into play as service demand rises and falls throughout the day, and can considerably impact performance of larger applications, he added.

"It's still better than having no data, but it isn't sufficient to actually tell you what the customer is going to be experiencing," Sloss said.

Becoming savvier about risks, tradeoffs

Most cloud vendors' service-level agreements (SLAs) include the same 99.95% uptime guarantee. Private data centers are not immune to outages, but the lack of control and insufficient availability from public clouds deter some IT shops. That's also partly why most public cloud workloads aren't used for production or mission-critical applications.

CityMD, an urgent care organization with 50 facilities in New York and New Jersey, uses Google for email collaboration and hosts some of its external websites with AWS, but none of its infrastructure is in the public cloud.

It was more cost-efficient to put our infrastructure in the data center.
Robert Florescuvice president of IT at CityMD

"We are almost a 24/7 shop, so the uptime is a big priority for us," said Robert Florescu, vice president of IT at CityMD. "It was more cost-efficient to put our infrastructure in the data center."

IT pros are at different maturity stages with the cloud, so there's still some fear of unavailability, said Kyle Hilgendorf, an analyst with Gartner. While most are not putting their most important assets in the public cloud, they are confident in vendors' track records and are willing to be proactive in their design.

"They so desire cutting-edge features and services that [public clouds] offer that they can't build in their own data centers that they're willing to take some of the risk," Hilgendorf said.

Very few customers seek higher-level uptime guarantees than public cloud vendors' typical SLAs, because the benefit would be marginal compared with the additional cost and engineering required, Sloss said.

"When you actually have to make tradeoffs -- and to be fair to Amazon, we're all in the same boat -- with any of us, if you want to go beyond running in a single zone, then you're going to have to do additional work," Sloss said.

Still, cloud vendors go to great lengths to avoid disruption. At the most recent re:Invent, the annual Amazon Web Services user conference, Jerry Hunter, vice president of infrastructure, discussed the extent to which Amazon has gone to control uptime, including steps to improve building design and energy usage. The company also built its own servers, network gear and data center equipment to reduce potential breakdowns, and improved automation to reduce human errors and help increase reliability.

AWS has purpose-built networks in each data center, with two, physically separated fiber paths extending out from each facility. Amazon has built its own Internet backbone to improve Web performance and monitoring capabilities.

"We've turned our supply chain from [a] potential liability into a strategic advantage" using input from the retail side of the business in Amazon fulfillment centers, and learned how to modify processes and workflows, Hunter said.

Several years ago, there were considerable discrepancies between vendors, but by now, all the big providers have industrialized and automated their processes to close the gaps, said Melanie Posey, an analyst with IDC in Framingham, Mass. Customers also have put in more redundancies to redirect users and processes to another region if there's an outage in their primary data center.

How customers are becoming more savvy

IT pros who use Amazon not only are increasingly savvy about strategies to protect themselves from downtime, they're dismissive of those who decry the cloud based on availability.

For example, after an AWS outage last September, IT pros said the most disruption came from people who saw it as evidence that cloud computing is inherently more risky than on-premises deployments. One user compared it to a single accident on a freeway leading drivers to conclude that highways in general are less safe than surface roads.

"Despite the inconvenience and all the press attention, you'd be hard-pressed to find corporate customers or consumer end users who are so fed up with the AWS outage that they would abandon their cloud services," wrote AWS expert and TechTarget contributor Jeff Kaplan shortly after the September issues.

All of this comes down to how much cost are you willing to absorb for the insurance of maybe protecting against the low likelihood of an unavailability event.
Kyle Hilgendorfanalyst at Gartner

Proactive users should take advantage of multiple availability zones and multiple regions, if possible, to protect against natural disaster or a region going down, Hilgendorf said. To take it a step further, the best practice to handle cascading software bugs or expiring certificates is to use multiple vendors.

"All of this comes down to how much cost are you willing to absorb for the insurance of maybe protecting against the low likelihood of an unavailability event," Hilgendorf said.

A more reactive stance involves management and monitoring with things such as CloudWatch, though this will help more with unavailability of a specific application, rather than the provider, Hilgendorf explained.

"Most of the unavailability events we hear from customers are, 'Oh, we screwed our application up, we misconfigured something,' or, 'We had a developer go in and change a setting, and everything went awry,'" he said.

Additionally, there are resources such as CloudTrail for logging API requests and other changes to do root-cause analysis.

Any cloud service could go down at any given moment, but given the number of tools made available to users now, in most situations, the fault probably lies with the customer, Hilgendorf said.

Next Steps

Considerations for choosing a public cloud provider

Breaking down what's in your cloud provider SLA

IT pros call for a cloud SLA evolution

Dig Deeper on Cloud computing SLAs