As with most magic, once you know how it's done, achieving the highest availability in the cloud seems mundane...
-- but that doesn't mean it's easy. Getting maximum uptime in the cloud requires new ways of building and managing applications. It requires thinking in terms of services, making them highly distributed, revisiting monolithic data models and embracing new testing models.
In part one, we discussed the old way of delivering highly available applications, which meant mainly relying on expensive and highly redundant infrastructure. But the old way doesn't work with cloud computing. Instead, companies must know the four rules of delivering highly available applications.
The first rule was to think of services, not applications, following the Netflix model of delivering apps. Rules two through four delve even deeper into this evolving model of cloud-native applications.
Rule Two: Distribute, balance and scale
Once you have your services defined and built, the next step is to think about running multiple instances of each across multiple data centers. Each instance should be able to operate independently, serving up requests from any other authorized service.
With multiple instances of these services, distribute requests using load balancing. The load-balancing model can detect if an instance is no longer available, if, for example, a server or some other part of the system has failed. Instead of the old "active-active" model with an all-or-nothing approach, distributing service requests requires far less investment in redundant hardware.
Due to the "micro" nature of a service, a single instance failing has a very small blast radius. If one out of 10 instances fails, capacity drops by 10%, which is absorbed by the remaining nine with no issue.
Finally, demand should drive the number of deployed instances of any service. More instances are needed at peak hours than off-peak. Given the metered pay-per-use model of cloud, it no longer makes sense to provision more than necessary and running up your bill. Autoscaling ensures that the number of services running is matched to the demand happening at that time.
Rule Three: Don't let the database PWN you
When you decompose an application into services, it's critical to decompose the data. A single large database with hundreds of tables, indexes and complex queries can be fatal to a distributed cloud environment. Your big, highly denormalized and finely tuned Oracle database could be an albatross that inhibits distributed services -- and thereby availability.
Many data models can be divided into multiple independent databases that are replicated and distributed throughout the cloud. There is a natural relationship among customers, addresses, orders, payments and other data. However, in most systems, the need to maintain all of this in a single transaction-consistent instance is not required.
Revisiting the Netflix model, it isn't critical for the "Recently Watched" list to be 100% consistent across the many copies that maintain and serve it. Within a minute or two, it will catch up -- but nobody is going to stop using Netflix if it's not perfect all the time. Many copies of this data that are more or less in sync at any given moment are good enough.
Rule Four: Embrace chaos
Now that you've gone all in on services, distributed them to the four corners of the globe, load-balanced and autoscaled them, as well as tamed your database monster, you're left with a complex system. Instead of inline code dependencies on other code, you have services that are dependent on each other to be reliable and predictable in their behavior.
What happens if a service is available, but returning junk? What happens if some odd occurrence causes a service to go crazy and flood others with requests that clog up the system? What happens if the load balancer fails?
Well, you should go find out. The only way to really do that is to cause the chaos and watch what happens. Introduce randomness generated by a "chaos service," such as Netflix's Simian Army, and randomly kill services, servers or even regions to see what happens. But do it in your QA environment first, of course. When your system surprises you -- and it will -- learn, troubleshoot and fix the underlying cause. Over a short period of time, your system will become solid and will improve continuously from then on.
Verizon to blame for slow Netflix streaming?
Netflix clearing a path in the cloud