James Thew - Fotolia
For years, a cloud SLA has been a formal pact between cloud providers and their users. But to many organizations, these contracts remain a mixed blessing.
On one hand, service-level agreements (SLAs) establish trust; users want assurance that their provider will guarantee a certain level of service. Just as important, they want assurance that their provider must pay if those guarantees aren't met.
But on the other hand, many cloud experts and users agree that understanding or negotiating an SLA can be a daunting -- and often confusing -- task. And even as more applications move to the cloud, some users still feel cloud SLAs aren't cutting it.
One of the biggest qualms IT pros have with cloud provider SLAs is that they're too cut-and-dried. Many SLA terms only address cloud availability in terms of a service being "up" or "down," and don't account for the various performance levels in-between.
"The reality of any of these clouds is that they are not either up or down -- they are always in some state of in-between," said a senior technologist at a major Los Angeles-based film production company that uses a mix of cloud services, including those from HP and OpenStack.
"There is always something broken, or there is always something degraded," he said. "If a [cloud vendor] can deliver the service, even if it's slow or intermittent, then they consider it up."
SLAs should include more granular terms around concepts such as provider response and recovery time, said Jason Andersen, VP of business line management at Stratus Technologies, a provider of fault-tolerant servers and software to cloud vendors and telecom carriers, based in Maynard, Mass.
"Some customers are starting to realize, 'Well, wait a second, I don't want it to just have a binary, red-light, green-light kind of thing,'" Andersen said. "If it's somewhere in-between, and it's a yellow light, we need to do something about that."
Time to resolution -- or the time it takes a cloud provider to fully resolve an issue with its service -- is an SLA term that cloud users should especially scrutinize and seek out, said Aaron Tantleff, partner and intellectual property lawyer at Boston-based Foley & Lardner LLP, during a cloud meet-up this month. He also agreed that an SLA should define cloud service availability in much more granular terms than "up" versus "down."
"Of all the major data points, [time to resolution] is probably the one that's the least talked about, but the one that's the most useful," Tantleff said.
Cloud credits come up short
Another common issue with cloud SLAs is the concept of a cloud provider reimbursing a user with credits in the event of a service outage. While sometimes useful, these credits -- which users can generally apply to their cloud bill or toward the purchase of another cloud service -- rarely make up for the financial or business loss that can occur after an outage, said the senior technologist at the film production company, who asked to remain anonymous.
"We have had experience -- mostly in dev/test environments -- where if something is out for a long time, you get some sort of credit against your bill or some discount on part of the service, but that's not very satisfying," he said. "It's not about the money. It's about the ability to deliver a service."
While most major cloud providers offer these credits, they hardly make up for the business disruption that an outage can cause, said Jason Read, founder of CloudHarmony, based in Laguna Beach, Calif., a company that conducts independent monitoring of cloud vendor uptime and was recently acquired by Gartner.
"Usually, the type of credit that you are going to get from an SLA-based policy is just going to be, really, a token of the impact to your business," Read said. "It's not going to amount to much compared to the possible hardship that your business has to endure during the downtime."
Because of this model, and the fact that cloud giants such as Amazon Web Services and Google continue their race to the bottom in terms of cloud service pricing, users are eventually going to base their cloud provider decisions less on cost, and more on reliability, said the senior technologist.
"It's going to be all about service delivery and uptime," he said.
Cloud users need to step up to the plate
Regardless of their provider's SLA, cloud users must take steps to protect their business and mission-critical applications in the event of an outage. And one of the best ways to do that is to build cloud applications that are resilient, fault-tolerant and designed specifically for cloud.
"Your application has to be architected and deployed in such a way that you can have application uptime and service uptime, even though you are going to have component outages," said the senior technologist.
A cloud SLA should include more granular terms around provider response and recovery time, said Jason Andersen, VP of business line management at Stratus Technologies, a provider of fault-tolerant servers and software, based in Maynard, Mass.
Specifically, cloud SLAs should include terms that span the complete "fault management cycle" of a cloud service, Andersen suggested. This includes five key phases of cloud service failure:
- Realizing there's an issue
- Locating the issue
- Isolating the issue
- Recovering from the issue
- Full repair of the issue
Designing applications for redundancy and fault-tolerance is also crucial for another reason: Many major cloud provider SLAs are dependent on it. For instance, some providers reserve their highest availability and uptime levels -- or certain credits -- strictly for resilient and cloud-native applications. These applications can often span multiple availability zones or multiple virtual machines.
"You really need to design for failure," Read said. "Take advantage of a lot of the capabilities that cloud providers give you to deploy redundant, fault-tolerant applications using multiple servers, for example, or multiple virtual servers."
In addition to building resiliency into their applications, cloud users should proactively monitor the health and performance of their cloud apps. And to do this, they shouldn't rely solely on their provider's monitoring tools, which may not provide enough detail. In-house or third-party cloud monitoring tools, such as Cloudyn or RightScale, are often needed instead.
"One of the challenges with SLAs is how do I measure them? Google and Amazon, in particular, aren't very transparent with what caused an outage and how widespread it is," the senior technologist said. "So, one of the things we do is use external service health providers to check that our services are available to outside [clients]."
Cloud users definitely have to "stay on top" of their cloud applications' performance, Read said, especially since many providers are slow to update the status pages for their services.
Lastly, it's important for cloud users to take application and service monitoring into their own hands, because some cloud providers aren't liable for a service failure until a user reports it to them, said Foley & Lardner's Tantleff.
"In most cases, they know that your service is out, but they don't have any liability to you until you tell them, 'By the way, I'm out,'" Tantleff said. "So the question then is, 'What are you doing to monitor your service and what are you doing to make sure you're available?'"
Kristin Knapp is site editor for SearchCloudComputing. Contact her at email@example.com or follow @kknapp86 on Twitter.
Ten items to look for in a cloud SLA
Know if your cloud provider follows through on an SLA
Key questions to ask your cloud storage provider