This content is part of the Essential Guide: Guide to cloud application testing
Manage Learn to apply best practices and optimize your operations.

Cloud outages: What to do when cloud services and apps fail

Learn how to identify cloud outages quickly and best practices for resolutions.

When it comes to cloud service outages, the question is not if, but when. That's why cloud application and services users and their software architects must design ways to exert control when disaster strikes. Either automatically or manually, turning off the cloud service in question until the issue is resolved is best, according to Michael Kopp, technical product manager at ruxit, a division of Dynatrace based in Waltham, Mass. The trick is identifying a crisis quickly enough to minimize damage.

Cloud service and application problems typically cause problems with applications ceasing to working or slowing down. "The truth is that most companies are ill prepared for this or do not even notice it until customers are calling, or it is mentioned on Twitter," said Kopp, who formerly worked for Compuware before Dynatrace separated into its own company. 

Typically, software architects handle a cloud outage by either automatically or manually turning off the service in question until the issue is resolved, Kopp said. Tools used for turning off cloud apps can include a user experience management (UEM) tool, which enables third-party and application level response time for all users currently engaging with their application; a synthetic monitoring tool, used to cover global point of presence and inform application managers if no users are actively using the impacted service and a performance analytics, or outage, tool, which uses big data analytics to determine if a particular cloud service is down and which region is affected.

Michael KoppMichael Kopp

Kopp offers advice on when, why and how to build in redundancy and automate turning off cloud services and applications in this interview.

In which situations would a cloud customer need to turn off a cloud application or service?

Michael Kopp: If a third-party or auxiliary internal service caused deterioration of the customer experience, and it was not part of the application flow, it should be turned off automatically and flagged. Turn off recommendations, analytics, ads, news tickers and so on.

The detection requires a UEM solution in place. [UEM] together with an outage analyzer ... also allows you to identify whether the problem is geographically localized or more general; Facebook might not be reachable in a specific region, or the recommendation service  is down in Europe but not in the US. If a geographic region can be determined, then the service could be turned off for the affected region only. After this, the issue would be flagged and the responsible party informed. Then APM data could be used to analyze and quickly fix the issue.

How common is it for organizations to spread apps across cloud service providers in order to ensure service reliability? What are some best practices for this approach?

Build redundancy into the application in order to avoid outages.
Michael Kopp

Kopp: This approach is not yet very common, as it involves a higher level of investment in terms of R&D, operations and also money and legal, [such as] different contracts. However, there are several third-party services that nowadays do that. So-called cloud brokers or cloud managers enable the customer to build the application in relative independence of a particular cloud provider and distribute an application across multiple providers. We see this most often in larger global companies that need many different points of presences and can therefore not rely on a single cloud provider. A best practice is to either use one of these cloud brokers or managers or to use a software management [tool] to abstract you from the differences of the cloud providers.

In all cases, [protecting against cloud outages] requires APM to support global and cross-cloud platform applications. [This allows] intercommunication between components in different clouds. This is [important] as data ingress has a dollar value attached in most clouds and should therefore be minimized.

When building or selecting applications for the cloud, what architectural options can be used for maintaining control or on/off capabilities? What are best practices for architecting apps in a way that makes it possible for application managers to turn off a service that causes a problem?

Kopp: Features should clearly be separated into main and auxiliary features. Auxiliary features should be hooked into the application flow via dependency injection and a proxy; software concept, not the network feature. The proxy can then be designed to autonomously decide whether to call the service or not. At the same time, this proxy can expose an option to the operations console enabling the application manager to manually turn off/on that feature. This way the turn on/off functionality is clearly separate -- separation of concerns -- from the feature of the auxiliary service. In addition, the proxy can be standardized across many services, making it easier for management and APM solutions to automatically turn features on and off.

What are design best practices for protecting against cloud services outages?

Kopp: Customers than run [applications] in public clouds typically design their application to either be distributed across multiple zones or automatically fail over into another zone. In other words, build redundancy into the application in order to avoid outages.

The key is an application design that can not only turn off services, but also has a defined way of working if that service is not there. The other key bit is that production is never exactly like dev or test, so we need to have full visibility and at the same time a safe way to react automatically to slow down or availability events. On the other side of things -- problems in production can often not be easily reproduced; therefore, we need a way to analyze the issue without reproduction. These points are the domain of APM.

Ask our experts about how to avoid cloud service problems that you worry about. Let us know [email protected]

About the expert:
Michael Kopp has 12 years of experience as an architect and developer in  Java/JEE and C++. He currently works as a technical product manager at ruxit, a division of Dynatrace, defining product requirements and developing customer-centric solutions. He was formerly a technology strategist and evangelist at Compuware's APM Center of Excellence. Before joining Compuware, he was chief architect at GoldenSource.

Update: This article was updated on December 23, 2015, to state that Dynatrace is Michael Kopp's current employer. 

Dig Deeper on High availability and disaster recovery