Get started Bring yourself up to speed with our introductory content.

'Failure by design': Advice for surviving disaster in the public cloud

In this expert answer, contributor Chris Moyer offers advice on taking a 'failure by design' approach to prevent public-cloud catastrophe.

At a recent Gartner conference, a keynoter said, "If you're in the public cloud, you're going to be dealing with failure. Learn to deal with it. Design for failure." Any advice on how to approach failure by design?

Netflix is famous for making "design for failure" a popular catchphrase. Netflix officials have described failure by design as a feature, rather than a bug.

One great way to approach failure by design: Make sure you have no single points of failure.

The company started off by trying to catch failing servers or broken code, but eventually realized that the solution isn't to prevent any sort of failure; after all, things will always go wrong. Instead, as Netflix realized, what's important is to design for failure so that failures have minimal impact -- or no impact at all.

One great way to approach failure by design: Make sure you have no single points of failure. It's important to constantly test your production systems, ensuring that if one server dies, it's not going to create major issues. There are several ways you can do that. The most important: Make sure that you can automatically detect a problem, and then automatically repair it if possible.

Netflix has open-sourced several pieces of software that help with designing for public-cloud failure. These systems help by making sure that your servers always fail. They started with Chaos Monkey, which randomly kills Amazon Web Services Elastic Cloud Compute (better known as EC2) instances to test whether apps will survive such failures. The idea is that every instance you run in AWS should be in an auto-scaling group, so when one goes down, another automatically replaces it.

Netflix also offers an open-source management and deployment tool called Asgard, a dashboard app that you can use to get an overview of your system architecture as a whole. It's available free to AWS users.

The biggest thing to keep in mind: You can't really test system architecture on a small scale. To really test whether a system can handle the full load it's going to get, you've got to go full scale. Good luck!

Dig Deeper on High availability and disaster recovery

Start the conversation

Send me notifications when other members comment.

Please create a username to comment.