Netflix Chaos Monkey tool protects against cloud failure, outages

Netflix releases the first of an 'army' of test tools meant to proactively break your cloud before an outage does.

Netflix's "Chaos Monkey," a cloud-testing technology, ensures that the company's service remains up and running on Amazon Web Services -- even during outages that affect parts of the public cloud infrastructure that its service runs on.

The company recently released the Chaos Monkey code to the open source community under the Apache 2.0 license via GitHub, a move that's being hailed by cloud experts and other observers -- and one that portends the arrival of a veritable army of testing tools over time based on the concept.

"I can only imagine that Chaos Monkey becoming available for a more general public will result in nothing but better results for AWS customers," said Kyle Hilgendorf, principal research analyst at Gartner Inc. in Stamford, Conn.

More resources on Chaos Monkey and cloud testing tools

Get tips on testing in the cloud

Watch a video about a cloud application testing tool

Read about cloud platforms vs. systems management tools

The Netflix Chaos Monkey tool allows cloud customers to proactively launch attack code against their own infrastructure to cause failures. These failures give engineers the ability to fix potential problems before they occur on their own.

Some cloud service providers said they have already encountered cases of their customers using Chaos Monkey to help bulletproof their AWS-based infrastructure and test code for hardening their apps.

That may be because of the similarities between streaming massive amounts of data interactively in the cloud -- as Netflix does -- and handling other workloads that are appropriate to the same model, such as social computing and collaboration in a corporate cloud environment.

"The short form is that tools like this help folks design more robust infrastructure in the public cloud," said Matthew Gerber, CEO of IT-Lifeline Inc., a disaster recovery provider built on AWS based in Liberty Lake, Wash.

Planet of the Apps

There is also value in Chaos Monkey for cloud application development.

"This will be a great tool for developers to use to aid in the construction of resilient Web applications [because] when a website is down, a company is losing money or trust," said Colin Dean, president of the Pittsburgh LAN Coalition, a group that organizes video gaming events and uses AWS.

In fact, Netflix says it has an entire "Simian Army" of testing applications in work, including a "Latency Monkey" that creates latencies that simulate service degradation, a "Conformity Monkey" that shuts down instances that don't conform to best practices, a "Janitor Monkey" that cleans up by releasing unused resources, a "Chaos Gorilla" that simulates outages of entire Amazon Availability Zones, and others.

Netflix said that since no single component can guarantee 100% uptime, its goal was to design a cloud architecture where the individual components could fail without making the entire system unavailable.

Chaos Monkey has terminated some 65,000 instances running in production and test environments over the past year, according to a Netflix blog post.

Additionally, Netflix said it is queuing other monkeys for release, likely beginning with Janitor Monkey.

"Any kind of toolset that helps customers build more reliable infrastructure is a good thing," Gerber said.

How Chaos Monkey works

The testing methodology is based on the idea that cloud service disruptors create havoc by causing arbitrary outages during business hours while IT staff are available to handle those situations. It proactively breaks running cloud components to expose weak spots so they can be fixed in advance of an actual outage.

Hilgendorf said that Chaos Monkey does not attempt to maliciously affect AWS infrastructure. It simply randomly stops and kills customers' own Amazon Machine Images or storage volumes so that customers can be assured that their design can handle random failures. 

How do cloud providers feel about customers destructively testing their cloud apps? Neither Amazon nor Microsoft responded to requests for comment about the impact on their respective cloud platforms.

Netflix said Chaos Monkey could be made to run on other cloud platforms.

Dig Deeper on Cloud infrastructure monitoring