Chaos Monkey

From HandWiki
The logo for Chaos Monkey used by Netflix.

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure.[1] It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.

Concept

In software development, a given software's abilities to tolerate failures, be resilient, and ensure adequate quality of service are often specified as requirements. However, development teams often fail to meet these requirements due to factors such as short deadlines or lack of knowledge of the field.

While overseeing Netflix's migration to the cloud in 2011,[1][2] Greg Orzell had the idea to change this paradigm by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. This proposal changed their assumptions from a model where there would be no breakdowns to a model where breakdowns were certain, ensuring that built-in resilience was an obligation rather than an option:

"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."[3]

By regularly "killing" random instances, it was possible to test a redundant architecture to ensure that a server failure did not noticeably impact end-users. The name Chaos Monkey is explained in the book "Chaos Monkeys" by Antonio Garcia Martinez:[4]

"Imagine a monkey entering a "data center", these "farms" of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand. The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."

The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.[5][6]

Different variants of the Simian Army

Simian Army logo by Netflix

The Simian Army[6][7] is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure and includes the following tools:[8]

Chaos Monkey

The first tool developed by Netflix, it allows random selection of instances in the production environment and deliberately put them out of service.

Chaos Gorilla

At the very top of the Simian Army hierarchy, Chaos Gorilla, drops a full Amazon Availability Zone.[9]

Latency Monkey

Introduces communication delays to simulate degradation or outages in a network.

Doctor Monkey

Performs health checks, monitoring performance metrics like CPU load to detect unhealthy instances for root-cause analysis and eventual fixing or retirement.

Janitor Monkey

Identifies and disposes unused resources to avoid waste and clutter.

Conformity Monkey

Tool that determines whether an instance is nonconforming by applying a set of rules on it. If any of the rules determines that the instance is not conforming, the monkey sends an email notification to the owner of the instance.

Security Monkey

Derived from Conformity Monkey, a tool that searches for and disables instances that have known vulnerabilities or improper configurations.[10]

10-18 Monkey

Tool that detects problems with localization (l10n-i18n) for software serving customers across different geographic regions.

Chaos Monkey and DevOps

The Devops Tool Chain

As part of the DevOps movement, special attention is paid to the safe operation of computer systems, thus providing a sufficient level of confidence despite frequent releases. By contributing to the Devops Tool Chain, Chaos Monkey meets the need for continuous testing.

They are part of the pattern "Design for failure," [11] "designed to support failure": a computer application must be able to support the failure of any underlying software or hardware component.

Related projects

Chaos Engineering

Chaos Engineering is the discipline of experimentation on a distributed system to build confidence in the system's ability to withstand turbulent production conditions.[12] This is a community built around the principles defined on the site http://principlesofchaos.org/, initiated by Netflix.[13]

Facebook Storm

To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.[14]

Days of Chaos

Inspired by AWS GameDays[15] to test the resilience of its applications, teams volunteer applications from Voyages-sncf.com participated in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses, and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.[16]

Presented at the 2017 DevOps REX conference[17] the concept is presented on the site http://days-of-chaos.com in order to collect the other experiments.

Chaos Toolkit

The Chaos Toolkit was born from the desire to simplify access to the discipline of Chaos Engineering and demonstrate that the experimentation approach can be done at different levels: infrastructure, platform but also application. The Chaos Toolkit is an open-source tool, licensed under Apache 2, published in October 2017.[18]

Notes and references

  1. 1.0 1.1 "The Netflix Simian Army". Medium. 2011-07-19. https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116. 
  2. Orzell, Gregory S. & Yury Izrailevsky, "Validating the resiliency of networked applications", A1 patent US20120072571 A1
  3. "Netflix Chaos Monkey Upgraded". Medium. 2016-10-19. https://medium.com/netflix-techblog/netflix-chaos-monkey-upgraded-1d679429be5d. 
  4. "Mais qui sont ces singes du chaos ?" (in fr-FR). 2017-07-25. https://15marches.fr/business/singes-du-chaos. 
  5. "Netflix libère Chaos Monkey dans la jungle Open Source - Le Monde Informatique" (in fr). LeMondeInformatique. http://www.lemondeinformatique.fr/actualites/lire-netflix-libere-chaos-monkey-dans-la-jungle-open-source-49940.html. 
  6. 6.0 6.1 "SimianArmy: Tools for your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures". Netflix, Inc.. 2017-10-20. https://github.com/Netflix/SimianArmy. Retrieved 2017-10-21. 
  7. SimianArmy: Tools for keeping your cloud operating in top form. Chaos Monkey is a resiliency tool that helps applications tolerate random instance failures, Netflix, Inc., 2017-11-07, https://github.com/Netflix/SimianArmy, retrieved 2017-11-07 
  8. SemiColonWeb (2015-12-08). "Infrastructure : quelles méthodes pour s'adapter aux nouvelles architectures Cloud ? - D2SI Blog" (in fr-FR). D2SI Blog. http://blog.d2-si.fr/2015/12/08/architecture-cloud/. 
  9. (in en) The Netflix Simian Army, https://medium.com/netflix-techblog/the-netflix-simian-army-16e57fbab116, retrieved 2017-12-12 
  10. https://github.com/Netflix/security_monkey
  11. "The Great Patterns of the Web - Design for failure | OCTO Talks!" (in en-US). https://blog.octo.com/design-for-failure/. Retrieved 2017-10-22. 
  12. "Principles of Chaos Engineering". http://principlesofchaos.org/. Retrieved 2017-10-21. 
  13. "Chaos Engineering by Netflix". https://www.infoq.com/en/news/2014/10/netflix-chaos-engineering. Retrieved 2017-10-22. 
  14. Hof, Robert (2016-09-11), "Interview: How Facebook's Storm Heads Off Project Data Center Disasters" (in en), Forbes, https://www.forbes.com/sites/roberthof/2016/09/11/interview-how-facebooks-project-storm-heads-off-data-center-disasters/#3bc1f4f64875, retrieved 2017-10-21 
  15. SemiColonWeb (2016-07-04). "GameDay AWS: test the resilience of your applications Cloud" (in en-US). http://blog.d2-si.fr/2016/07/04/aws-gameday/. Retrieved 2017-10-21. 
  16. "DevOps: feedback from Voyages-sncf.com - Blog du Moderator" (in fr-FR), Moderator's Blog, 2017-03-17, https://www.blogdumoderateur.com/devops-retour-experience-voyages-sncf-com/, retrieved 2017-10-21 
  17. "Days of Chaos: the development of the devops culture at Voyages-Sn ..." (in en). 2017-10-03. https://en.slideshare.net/devopsrex/days-of-chaos-the-development-of-culture-devops-your-voyagessncfcom-laid-. 
  18. Miles, Russ (2017-10-06). "Introducing and Extending the Chaos Toolkit". https://medium.com/russmiles/introducing-and-extending-the-chaos-toolkit-ddfa142acc2b. Retrieved 2017-10-23.