Essential Guide

Browse Sections


Problem solve Get help with specific problems with your technologies, process and projects.

Limiting the effects of cloud computing outages

After making the case for cloud based on high availability, IT pros are upset by cloud outages. With planning, there are ways to craft a strategy.

Contrary to popular belief, cloud services actually fail more often than internal data center facilities do. The...

cloud isn't inherently unreliable, but like all forms of IT, cloud services have to be selected carefully and managed to achieve specific reliability and availability goals. The steps can be contractual, technical or may even involve rethinking your application architectures. Without careful consideration, you may get less from your cloud services than you expected.

SLAs mitigate risk of using cloud providers' data centers

Protecting against cloud outages starts by assessing the reliability of cloud providers' data centers. The majority of cloud providers have a small number of data centers, often only one, and these data centers are subject to the same kinds of failures as an enterprise. The most publicized cloud failures occur when an entire cloud data center fails, usually because of a natural disaster. To protect yourself in case of failure, you'll either have to ask for specific data center configuration information or obtain an availability guarantee from your provider.

The most common cause of cloud failures is usually not the cloud at all, but the network.

For server, storage and network reliability, the best strategy is to negotiate a service-level agreement (SLA) specifying the availability guarantee and the time to restore service if it's lost. It's important to understand whether a cloud data center is located in an area where natural disasters like hurricanes or blizzards are common. Also find out if the data center has backup power and whether there's a backup data center that can pick up the load.

The backup data center must be located in another region than the primary one, so it's unlikely to be impacted by the same conditions, and it must have enough capacity to handle failover of cloud applications. Since few providers would provide sufficient backup data center capacity for 100% failover of a primary data center, the SLA should indicate how failover is managed.

It may be necessary to pay for priority in this situation. If your cloud service includes geographic diversity to support a distributed user population, your own diverse facilities may provide some protection against a cloud provider failure; check your contract carefully to ensure there's enough capacity to handle the additional load.

Network performance -- or lack thereof -- leads to cloud outages

The most common cause of cloud failures is usually not the cloud at all, but the network. The majority of cloud computing applications are accessed via the Internet, and Internet availability creates most cloud computing outages. The only way to address this is to move off the Internet to a virtual private network (VPN) or virtual local area network service, or to secure multiple Internet service providers (ISPs) for sites accessing cloud applications. This may be a good option if security and compliance issues can be addressed and contracted for by the provider. It will likely involve a special charge, unless the cloud provider already uses the carrier providing your VPN.

With Internet service costs falling for small businesses, it's practical to provide a branch office with two ISPs. However, ensure there are no common points of failure between the two offices. Peering points and shared interconnection "hotels" are often shared among providers. Even common access wiring between the ISPs will defeat the benefits of having dual network connections.

Cloud application resilience must be addressed

If cloud data center and cloud network failures have been addressed, the next question is the resilience of the applications themselves. The greatest problems in managing high availability and cloud services involve both database access and reliable transaction processing.

When a data center fails, the data stored there is unavailable, even if another data center can back up the applications using the data. Unless application data is maintained in "hot standby" form in multiple locations, a failure will result in loss of data access, which makes other redundancy measures largely ineffective. This same problem exists for internal data center backup, so companies who have provided their own data center redundancy may find the same procedures will work in the cloud. This is less of a technical strategy than a financial one, though; the cost of maintaining redundant data in the cloud is higher because of cloud storage and access charges. A better solution may be to house all your data on-premises in a high availability, protected data center, and access it from multiple cloud locations.

The best availability management will have to be integrated with the applications themselves. Any time database updates are made to multiple copies at the same time there's a risk of loss of data integrity if a failure occurs during the update process. Online transaction processing systems usually include a "two-phase commit" process to back out transactions that don't update all database copies successfully. Sometimes network failure leaves even single-database updates in an uncertain state. It's essential to review application designs to ensure failures of the network or of a data center where databases are stored won't create a risk of contaminated or inconsistent data.

It's not unreasonable to expect cloud applications to be as -- or more -- reliable as on-premises applications. And, reliability and specific goals you set are likely to cost you. Remember to consider reliability costs when building your cloud business case or you may find your applications will have to trade ad hoc between reliability and cost.

About the author
Tom Nolle is president of
CIMI Corporation, a strategic consulting firm specializing in telecommunications and data communications since 1982.

Dig Deeper on High availability and disaster recovery

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.