Chaos Engineering

by:

Ali Basiri, Niosha Behnam, Ruud de Rooij, Lorin Hochstein, Luke Kosewski, Justin Reynolds, and Casey Rosenthal,

Netflix

This is an excerpt of the article published in the July 2016 edition of Computing Now at  https://www.computer.org/cms/Computer.org/ComputingNow/issues/2016/07/mso2016030035.pdf

 

Modern software-based services are implemented as distributed systems with complex behavior and failure modes. Chaos engineering uses experimentation to ensure system availability. Netflix engineers have developed principles of chaos engineering that describe how to design and run experiments.

 

THIRTY YEARS AGO, Jim Gray noted that “A way to improve availability is to install proven hardware and software, and then leave it alone.”1 For companies that provide services over the Internet, “leaving it alone” isn’t an option. Such service providers must continually make changes to increase the service’s value, such as adding features and improving performance. At Netflix, engineers push new code into production and modify runtime configuration parameters hundreds of times a day. (For a look at Netflix and its system architecture, see the sidebar.) Availability is still important; a customer who can’t watch a video because of a service outage might not be a customer for long.

But to achieve high availability, we need to apply a different approach than what Gray advocated. For years, Netflix has been running Chaos Monkey, an internal service that randomly selects virtualmachine instances that host our production services and terminates them.2 Chaos Monkey aims to encourage Netflix engineers to design software services that can withstand failures of individual instances. It’s active only during normal working hours so that engineers can respond quickly if a service fails owing to an instance termination.

Chaos Monkey has proven successful; today all Netflix engineers design their services to handle instance failures as a matter of course.

That success encouraged us to extend the approach of injecting failures into the production system to improve reliability. For example, we perform Chaos Kong exercises that simulate the failure of an entire Amazon EC2 (Elastic Compute Cloud) region. We also run Failure Injection Testing (FIT) exercises in which we cause requests between Netflix services to fail and verify that the system degrades gracefully.3 Over time, we realized that these activities share underlying themes that are subtler than simply “break things in production.” We also noticed that organizations such as Amazon,4 Google,4 Microsoft,5 and Facebook6 were applying similar techniques to test their systems’ resilience. We believe that these activities form part of a discipline that’s emerging in our industry; we call this discipline chaos engineering. Specifically, chaos engineering involves experimenting on a distributed system to build confidence in its capability to withstand turbulent conditions in production. These conditions could be anything from a hardware failure, to an unexpected surge in client requests, to a malformed value in a runtime configuration parameter. Our experience has led us to determine principles of chaos engineering (for an overview, see http://principlesofchaos .org), which we elaborate on here.