Nearly all cloud computing users think service-level agreements (SLAs) are important for the cloud, but most admit...
they don't have a handle on enforcing one. Without proper consideration and tools, even a good SLA will likely fail because you won't know it's being violated or why. To make the right cloud computing SLA decisions, divide your cloud into responsibility zones, use analytics to set baseline conditions for application behavior and think of SLA failures as "projects" with their own flow.
Drawing cloud responsibility zones
One of the challenges with a cloud computing SLA is that the experience delivered by a cloud application is the sum of the performance of three or more entities. Figuring out which one might be causing a problem can be a challenge, so the first task in creating an SLA decision framework for the cloud is to develop a simple entity map that shows who provides each portion of cloud service and where their portion transitions into others'.
A typical cloud application starts with user-owned facilities that provide user connection. This can be a mobile device or it may be an entire company network. From this user-supplied piece, the cloud application connects through a WAN, usually the Internet, and from that to the cloud provider's infrastructure. Some users employ VPN services for cloud access from fixed sites, and others may have more than one cloud provider, so it's possible that there will be more than the three standard responsibility zones in your own cloud.
Cloud applications generate workflows that move across these zones, and you'll want to understand exactly how that movement takes place for each type of cloud application you run. It should be possible for you to say, based on the name of an application, just how work flows to fulfill its users' needs. That workflow is the basis for your SLA decisions.
For good SLA management and policy decisions, you need to measure the behavior of each of the suppliers in the cloud zones of your applications. You should always start with mechanisms to measure response time and move from that to measuring conditions at the boundary points of your zones.
End-to-end response time measurement is best done at the point of user connection so you can read the full response time. In some cases this means building response time monitoring into the application itself, although often the TCP/IP software for a device provides some of that data through a management interface.
For the zone-boundary monitoring mission some form of traffic or protocol monitoring is hard to beat. These tools put probes, software tools or hardware elements in the network at various places, and they allow a central management console to view the packet traffic using deep packet inspection to sort out applications.
Avoid network analytic pitfalls
One big mistake users make at this point is getting focused on monitoring without knowing what's good or bad. A network management system (NMS) may collect data in a repository naturally (OpenNMS does this, for example). This data collection allows you to run queries to analyze performance and conditions over time and set baselines for normal behavior as well as thresholds for what you'd consider to be SLA violations. If your management system doesn't provide a repository, you'll want to add network analytics tools to gather and correlate management data and set your performance baselines.
Network analytics can be a strong foundation for basing decisions around service-level agreements for cloud computing. Make sure that the tool has provisions for adding cloud performance data obtained from the cloud management system APIs to network data obtained from your own NMS. If you have a VPN or a hybrid cloud with a large data center component, it can even be smart to start first look at tools from your primary network vendor. These can always be helpful in maintaining your own IT and network infrastructure performance, and will also help manage cloud SLA-based decisions.
Of course, it all comes down to how SLA errors are detected. A good system provides for three inputs into the cloud SLA decision process. One is subjective user reports of poor performance; the second is a detected end-to-end response time problem for one or more applications; and the third a report of a specific problem at a zone boundary. In all cases, you should first assess the impact of the problem and then target possible contributors to it.
Your workflow-zone map will let you see whether there's a general problem with several applications at a zone boundary point, or with one application only. In the former case you probably are experiencing a network or cloud infrastructure problem and in the second a cloud application problem. For the first case, you need to use monitoring tools to examine all the zone borders in the workflows affected to see where a problem is occurring. That problem should be manifest as longer delay between two zone boundary points or a loss of packets at a boundary. Your traffic probes will usually identify either of these faults.
Handle cloud computing SLA decisions like a project
If there is a problem, the remediation should be treated as a small project, with a project manager and a fixed set of tasks that are usually called the escalation procedure. Some users even employ simple software project management or fault tracking tools to track the process of cloud computing SLA issues from their detection to their resolution. Fault tracking tools intended for software projects can sometimes be employed, but some network analytics tools include at least an option for fault tracking.
Taking an organized approach to a cloud computing SLA and the decisions that come out of its enforcement is critical if the SLA and the services it offers are to be successful. If you start your deliberations with plans to support SLA decisions, you'll have better service experiences overall.
Don't get hoodwinked by SLA promises
Wanted: A new kind of cloud SLA
Advice on what to include in a cloud SLA