The cloud is supposed to be a way of improving reliability for IT, but there are all-too-many stories of cloud provider failure that seem to prove otherwise. We now know that the cloud can be a backup resource and also that it may need backup itself. Application design has to navigate this apparent paradox, and architects can deal with cloud availability risks by effective application componentization, selective redundancy, cloud bursting and reliable transaction processing techniques.
Any IT infrastructure will fail at some predictable rate, including cloud infrastructure. As enterprises build more hybrid applications and rely on the cloud more to back up data center resources, they often build complex systems whose overall reliability is lower than that of the data center. When two components of an application are both required to be available for the application to run, the combined availability is lower than that of either component individually. That's why the first step in building cloud failure-proof applications is the careful design of the component structure and the application workflow.
The starting point is to make sure all workflow processes, component directories and nonduplicated database elements are hosted on high-availability infrastructure, which will often mean keeping them in-house. "What you cannot replicate, you must harden" is the watchword here, and that includes redundant power and careful analysis of your network connections' reliability.
To maximize availability in the componentization of application functionality, be aware that increasing the number of separately hosted components will typically lower availability, so it may be wise to limit componentization, even at the expense of runtime composability. Applications can be developed to modular software standards, but the deployed elements of the application can contain multiple components and so will require fewer hosting points, which means better reliability.
Reducing the number of components won't maintain operation if the hosting resources for a component fails. The notion of selective redundancy means that some components can be replicated on different host platforms (even different clouds, or cloud plus data center) to ensure that at least one copy is available at any given time. The "selective" point here is that there are only specific points in a workflow where multiple pathways for processing can be coordinated correctly without loss of state or possible collisions in updating databases. Find these places and utilize redundancy to improve availability, but remember that you may have to configure a cloud service to take advantage of geographic diversity and that if your cloud data center and prime data center are in the same metro area, they're not fully redundant.
If the interfaces between the application components being considered for selective redundancy are representational state transfer-style (stateless), then paralleling multiple components is relatively straightforward because the client/user element can be relied on to maintain state in case of a failover. Where stateful service-oriented architecture/Simple Object Access Protocol is used, it's possible that simply replicating a component won't be a solution, because a failure will almost surely break at least the current session or transaction. If this is a problem, then consider the reliable-processing suggestion later in this tip.
One useful way to spot examples for selective redundancy is to consider cloud bursting, the use of the cloud to offload work during peak-load periods. The practices that allow cloud bursting to work are the same as those needed to manage selective redundancy for availability purposes; they include a form of load balancing and a point of aggregation and serialization where updates to databases are managed. There are techniques available (domain name system, for example) to implement distributed load balancing without deploying a single point of failure, but in many cases, both the load-balancing and serialization elements of your application will have to be deployed as high-availability to ensure the cloud-bursting/redundancy process improves rather than reduces availability overall.
Most application architects know that transaction state management through a failover can be extremely complex and is likely to require a special workflow design to accommodate something often called reliable transaction processing or reliable session handling. Whenever an application involves a coordinated series of steps, it requires that the context of the process (the state) be maintained if a failover is needed. Otherwise, the activity will lose context, and that might leave databases (and the user) in an uncertain state.
In classical online transaction processing (OLTP), the management of parallel databases for higher availability is accomplished through synchronized updates and multiphase commits. Sometimes it's not possible to achieve the desired level of availability with a single high-availability component or to do "stateless" load balancing among copies of a component. In those cases, application-level state synchronization such as multiphase commit can provide a way to recover from a cloud failure without loss of transaction or database integrity. However, this increases the cost, and even complexity, of the application.
Session activities (including video and voice connections) can sometimes be made more reliable by providing a database logging process that records process state, so that a new component invoked because of a failure can restore state and reduce the impact on the user. These enhancements may be critical for collaborative applications, unified communications, etc., and they are facilitated by the growing trend to integrate session communications with Web applications (via Web Real-Time Communications, for example).
Many users think that the cloud offers automatic diversity and failure protection, but the fact is that even the limited facilities for availability management that are available from cloud providers may not actually protect you from cloud failures. A little advance planning in application design can go a long way in creating cloud applications that manage availability as well as any IT deployment can manage it, and that's the only approach enterprises will accept in the long run.
Follow us on Twitter at @SearchCloudApps.