At a recent Gartner conference, a keynoter said, "If you're in the public cloud, you're going to be dealing with failure. Learn to deal with it. Design for failure." Any advice on how to approach failure by design?
Netflix is famous for making "design for failure" a popular catchphrase. Netflix officials have described failure by design as a feature, rather than a bug.
One great way to approach failure by design: Make sure you have no single points of failure.
The company started off by trying to catch failing servers or broken code, but eventually realized that the solution isn't to prevent any sort of failure; after all, things will always go wrong. Instead, as Netflix realized, what's important is to design for failure so that failures have minimal impact -- or no impact at all.
One great way to approach failure by design: Make sure you have no single points of failure. It's important to constantly test your production systems, ensuring that if one server dies, it's not going to create major issues. There are several ways you can do that. The most important: Make sure that you can automatically detect a problem, and then automatically repair it if possible.
Netflix has open-sourced several pieces of software that help with designing for public-cloud failure. These systems help by making sure that your servers always fail. They started with Chaos Monkey, which randomly kills Amazon Web Services Elastic Cloud Compute (better known as EC2) instances to test whether apps will survive such failures. The idea is that every instance you run in AWS should be in an auto-scaling group, so when one goes down, another automatically replaces it.
Netflix also offers an open-source management and deployment tool called Asgard, a dashboard app that you can use to get an overview of your system architecture as a whole. It's available free to AWS users.
The biggest thing to keep in mind: You can't really test system architecture on a small scale. To really test whether a system can handle the full load it's going to get, you've got to go full scale. Good luck!
Dig Deeper on High availability and disaster recovery
Related Q&A from Chris Moyer
Can an application have Python as a container, run SQL queries on an external Microsoft SQL database and publish the results on an Apache web server ... Continue Reading
The wait is over, as you can now trigger Lambda functions with SQS messages. Follow these steps to get up and running with this new capability. Continue Reading
Event-driven computing means no IaaS provisioning and no data center to run. Can I migrate all enterprise apps to be event-driven? Continue Reading