Evaluating a cloud computing SLA? Look at app workflows

Cloud expert Tom Nolle gives advice on evaluating a cloud computing SLA, including common mistakes, app and hybrid cloud workflow issues, response-time SLAs and more.

Evaluating or writing a service level agreement (SLA) for your cloud service is a lot more complicated than writing...

one for simple connection services, such as virtual private networks (VPN). To get cloud computing SLA evaluations right, understand the elements of a cloud experience and who will actually provide them. Look for the app workflows, because critical application issues can kill even a good SLA. Also, be sure that you have a practical verification and remediation approach.

Public cloud computing services offer incredible agility and efficiency within their scope, but just how broad that scope is depends on service cost, as well as on availability and performance. This tip provides information on the common mistakes and best practices for evaluating cloud computing SLAs. Topics covered include response-time SLAs, getting guarantees from network and cloud providers, hybrid cloud SLA issues and more.

Follow the app workflow

The most critical mistake buyers of cloud services make with cloud computing SLAs is forgetting that all applications are really workflows. A request is passed over a network connection from a user to an application, often one made up of multiple components. That request can then cause work to flow to other components -- within the cloud or back into the data center -- and multiple accesses to databases that could be located in or outside the cloud. Eventually, the response is returned, via the network, to the user.

SLAs aren't useful if they focus on only one piece of this, such as the piece related to public cloud hosting. If any part of this workflow is interrupted, the application fails. If performance falls down anywhere along the flow, the application quality of experience is affected. It does no good to tighten the boundaries of performance or availability in the cloud when everywhere else is only loosely guaranteed.

Get SLA guarantees from all players

Another problem with evaluating cloud computing SLAs is the failure to get guarantees from the relevant players. The cloud workflow will usually involve a minimum of three players -- the worker's own local networking, the access network provider that gets workers to the cloud, and the cloud provider. It may also involve your company's data center -- for network and hosting -- and a different network provider who provides cloud-to-data-center connection. Providers can't usually write or accept SLAs that apply to the handling of workflow pieces they're not involved with. You'll need to either get them to agree to be a "prime contractor" for which they'll charge a fee, or get or write an SLA for each player involved.

The network connections are usually the biggest problem in SLAs because, in most cases, the cloud provider doesn't provide the network services, except within the cloud itself. You'll need to write an SLA for network services if you want stringent SLAs. And so, you should first see if your cloud provider will offer a VPN or whether they work with VPN providers you could use. In many cases, you'll still need to use the Internet to get users connected, but a VPN will give you a solid network boundary where you can expect to have guarantees.

Consider hybrid cloud workflows

"Border crossings" in hybrid clouds also create SLA issues. Workflows follow paths determined by application and business logic, and if these paths make multiple and variable crossings between the data center and the cloud, the performance and availability risk will rise. Your cloud provider can't hit a moving target of workflow patterns when guaranteeing performance or availability, so try to ensure that you don't introduce significant variables to the workflow where you expect firm guarantees. If you do, you'll have to write a very detailed and complex SLA to address all of the variables, and many providers simply won't accept it.

Define cloud computing SLA boundaries

The final issue in SLAs is detection of violations, as well as the penalties and processes of remediation. It is very unlikely that either you or your cloud provider -- or other network partners -- will accept an SLA based on one party's measurement of conditions. Good SLAs define a point of measurement at a boundary where each party to the agreement can make independent measurements for condition verification. Your own SLA should identify those points, the measurements to be taken and the measured conditions that will be considered a violation.

Availability and performance violations for packet network and cloud services are usually based on a fairly long reporting period -- i.e., outages per week or month. It's best to have downtime-per-interval agreements, rather than simple fault counts because the latter won't account for the mean time to repair. Response-time SLAs are harder to write due to the difficulty in measuring response times correctly. If you include response times in your SLA, take time to specify exactly how both parties are going to measure them.

Remediation or penalties are always a sticky point. Many users think they can get compensation for business loss if an SLA is violated -- what's usually called consequential damages. That's extremely rare and expensive; you'd be better off engineering your applications for high availability than trying to negotiate such an agreement.

Users report that the most helpful penalty in an SLA is an escalation clause. If an SLA failure occurs, the failure should result in a notification to the provider's operations center. If there is no resolution in a specified time or if the event frequency exceeds a threshold, then a notification should be sent up the provider's management chain -- with the higher level becoming responsible for checking the situation and personally contacting you with status updates and remedies. This guarantees senior management attention to your problems, which reduces the chances you'll have problems in the first place.

Financial penalties in SLAs should be limited to the cost of service during the outage period as a baseline, with perhaps a rebate of service costs over the entire measurement interval if the outage is severe. Your chances of getting such a penalty clause will depend on the size of your contract and your potential as a future customer.

Next Steps

Tips for determining a cloud SLA approach

What effective cloud SLAs should cover

Defining hybrid cloud SLAs

Dig Deeper on Cloud computing SLAs