Chaos Engineering – withstanding turbulent conditions in production

4.7.2018 | 12 minutes reading time

When you were a child, did you deliberately break or dismantle things in order to understand how they work? We all did – although some people have a greater urge to destroy than others. Today we call it Chaos Engineering.

As developers, one of our primary goals is to develop stable, secure and bug-free software that will not deprive us of sleep or new and exciting topics. To accomplish these and other goals, we write unit and integration tests, which alert us to unexpected behavior and ensure that the patterns we test do not lead to errors. Today’s architectures contain many components that can’t all be fully covered with unit and integration tests. Servers and components of whose existence we’re not aware still manage to drag our entire system into the abyss.

You don’t choose the moment, the moment chooses you!
You only choose how prepared you are when it does.
Fire Chief Mike Burtch

Introduction

In recent years, Netflix has been one of the drivers behind Chaos Engineering and has contributed significantly to the growing importance of Chaos Engineering in distributed systems. Kyle Kingsbury, security researcher, takes a slightly different approach, verifying the promises of manufacturers of distributed databases, queues and other distributed systems. With his tool Jepsen , he probes the behavior of the aforementioned systems and occasionally comes to frightening conclusions. You can find a very impressive talk on this topic on YouTube .

In this article I hope to give you a simple and descriptive introduction to the world of Chaos Engineering. As a neat side-effect, Chaos Engineering will allow you to personally meet all of your colleagues within a short time – whether you want to or not! (But only if you do it wrong.)

Do not underestimate the social aspect of Chaos Engineering; it is not just about destroying something, but also about bringing the right people together and jointly pursuing the goal of creating stable and fault-tolerant software.

Basics

When we develop new or existing software, we toughen our implementation through various forms of tests. We are often referred to a test pyramid that illustrates what kinds of tests we should write and to what extent.

Test Pyramid

The test pyramid illustrates the following dilemma. The more I move upwards with my test scenario, the more effort, time and costs arise.

Unit tests

When creating unit tests, we write test cases to check the expected behavior. The component we are testing is free of all its dependencies and we keep their behavior under control with the help of mocks. These types of tests cannot guarantee that they are free of errors. If the developer of the module had a logic error in the implementation of the component, this error will also occur in the tests – regardless of whether the developer has first implemented the tests and then the code. One possibility to solve this is Extreme Programming, in which the developers continuously alternate between writing the tests and implementing the functionality.

Unit Test

Integration tests

In order to allow the developers and stakeholders to spend more free time and relaxed weekends with family and friends, we write integration tests after the unit tests. These test the interaction of individual components. Integration tests are ideally run automatically after the successfully tested unit tests and test-interdependent components.

Integration Test

Thanks to high test coverage and automation, we achieve a very stable state of our application, but who does not know this unpleasant feeling on the way to the most beautiful place in the world? What I mean is production, where our software has to show how good it really is. Only under real conditions can we see how all the individual components of the overall architecture behave. This unpleasant feeling has even been reinforced by the use of modern microservice architectures.

Erosion of the software architecture

In the age of loosely coupled microservices, we arrive at software architectures that can be summarized under the umbrella term ‘distributed systems’. It’s easy to understand individual systems – and they can be deployed and scaled quickly – but in most cases this leads to an architecture like the following one:

Architecture Erosion

I love architecture diagrams, since they give us a clear and abstract view of the software being developed. However, they are also able to hide all the evil pitfalls and mistakes. They are especially good at obscuring the underlying layers and hardware. In real-life production environments, the following architectural diagram is closer to the status quo:

Architecture Erosion real

The load balancer doesn’t know all instances of the gateway or cannot reach them in the network due to a firewall rule. Several applications are dead, but service discovery doesn’t notice the failure. Additionally, service discovery can’t synchronize and delivers varying results. The load cannot be distributed due to missing instances, and this leads to an increased load on the individual nodes. And anyway – why does the twelve-hour batch have to run during the day and why does it need twelve hours?!

I’m sure you’ve had similar experiences. You probably know what it means to deal with defective hardware, faulty virtualization, incorrectly configured firewalls or tedious coordination within your company.

At times, statements like the following come up during discussion: “There is no chaos here, everything runs its usual course.” It may be hard to believe, but an entire industry lives from selling us ticket systems so we can control and document the chaos that exists. The following quote from a movie that my children enjoy describes our everyday life well:

Chaos is the engine that drives the world

API heaven vs backend hell

APIs suggest an intact world where we get exactly what we need via simple API calls, with well-defined inputs and outputs. We use myriad APIs to protect us from direct interaction with hell (backend/API implementation). Countless layers of abstractions are implemented in which Hades, god of the underworld, is doing his mischief. He makes sure that not a single API call ever comes back unscathed. Okay, I exaggerate, but you know what I’m getting at.

Netflix shows what this can lead to, so let’s take a quick look at their architecture to better understand the potential complexity of modern microservice architectures. The following picture is from 2013:

Netflix Architecture 2013

Reprinted from A. Tseitlin, “Resiliency through failure”, QCon NY, 2013

This is even more impressive how well Netflix’s architecture works and can react to all the possible errors. If you watch a talk by a Netflix developer, you will find the statement “nobody knows how and why this works”. This insight has brought Chaos Engineering to life at Netflix.

Don’t try this at home

Before you start your first chaos experiments, make sure that your services can already apply a resilience pattern and deal with the possible errors.

Chaos Engineering doesn´t cause problems. It reveals them.
Nora Jones
Senior Chaos Engineer at Netflix

As Nora Jones from Netflix rightfully points out, Chaos Engineering is not about creating chaos, but about preventing it. So, if you want to begin your chaos experiments, start small and ask yourself the following questions in advance:

Decision Helper – inspired by Russ Miles

Once again, it makes no sense to commence engineering chaos if your infrastructure – and especially your services – are not prepared for it. Heed this very important instruction, and we’ll now begin our journey into Chaos Engineering.

Rules of Chaos Engineering

Talk to your colleagues about the planned chaos experiments in advance!
If you know your chaos experiment will fail, don’t do it!
Chaos shouldn’t come as a surprise, your aim is to prove the hypothesis.
Chaos Engineering helps you understand your distributed systems better.
Limit the blast radius of your chaos experiments.
Always be in control of the situation during the chaos experiment!

Principles of Chaos Engineering

In Chaos Engineering, you should go through the five phases described below and keep control of your experiment at all times. Start small, and keep the potential blast radius of your experiments small as well. Simply pulling a plug somewhere and seeing what happens has absolutely nothing to do with Chaos Engineering! We don’t cause uncontrolled chaos; we actively fight to prevent it.

I warmly recommend the site PrinciplesOfChaos.org and the free ebook “Chaos Engineering ” written by the authors: Casey Rosenthal, Lorin Hochstein, Aaron Blohowiak, Nora Jones, Ali Basiri

Phases of Chaos Engineering

Steady state

It is essential to define metrics which give you a reliable statement about the overall state of your system. These metrics must be continuously monitored during the chaos experiments. As a nice side effect, you can also monitor these metrics outside of your experiments.

Metrics can be both technical or business metrics – I’d say that business metrics outweigh technical metrics. Netflix monitors the number of successful clicks to start a video during a chaos experiment; this is their core metric and it comes from the business domain. Customers not being able to start videos have a direct effect on customer satisfaction. For example, if you run an online shop, the number of successful orders or the number of articles placed in the shopping basket would be important business metrics.

Hypothesis

Think about what should happen in advance, and then prove it through your experiment. If your hypothesis is invalidated, you must locate the error based on the findings and bring it up with your team or company. This is sometimes the hardest part – definitely avoid finger-pointing and scapegoating! As a chaos engineer, your goal is to understand how the system behaves and to present this knowledge to the developers. This is why it’s important to get everyone on board early and to let them participate in your experiments.

Real-world events

What awaits us in real life? What new mistakes could happen, and which ones already ruined our previous weekends? These and other questions must be asked and tested for in a controlled experiment.

Potential examples include:

Failure of a node in a Kafka cluster
Dropped network packets
Hardware errors
Insufficient max-heap-size for the JVM
Increased latency
Malformed responses

You can extend the list however you like and it will always be closely linked to the chosen architecture. Even if your application is not hosted at one of the well-known cloud providers, things will go wrong in your own company’s data center. I strongly suspect you could tell me a thing or two about it!

Chaos experiment example

Say we have several microservices connected via a REST API and using service discovery. The ‘Product Service’ maintains a local cache of the stock from the ‘Warehouse Service’ as a safeguard. Data from the cache should always be delivered if the Warehouse Service has not replied within 500ms. We can achieve this behavior in the Java environment, for example with Hystrix or resilience4j . Thanks to these libraries, we can implement fallbacks and other resilience patterns very easily and effectively.

Environment

Chaos Experiment

Below you will find the information required for a successful experiment.

Target
Warehouse Service

Experiment Type
Latency

Hypothesis
Due to the increased latency when the Warehouse Service is called, 30% of the requests are delivered using the local cache of the Product Service.

Blast Radius
Product-Service & Warehouse Service

Status – Before
OK

Status – After
ERROR

Finding
Product Service fallback failed and led to exceptions, since the cache could not answer all requests.

As you can see from the result, we used resilience patterns but still had errors. These errors must be eliminated and tested for by running the experiment again.

Automated experiments

Chaos Engineering must be operated continuously. Your systems are also constantly changing: new versions are going into production, hardware is being replaced, firewall rules are adapted and servers are restarted. The elegant way is to establish the culture of Chaos Engineering in your company and in the minds of its people. Netflix achieved this by letting the Simian Army loose on the products of their developers – in production! They decided at some point to only do this during working hours, but still they do it.

Environment

For your first chaos experiments, please choose an environment that is identical to the one in production. You will not gain any meaningful insights in a test environment where nothing is going on and that does not correspond to the setup of the platform in production. Once you’ve gained initial insights and made improvements, you can move on to production and carry out your experiments there as well. The aim of Chaos Engineering is to run it in production, always being in control of the situation and keeping customers unaffected.

First steps into Chaos Engineering

At first, it was difficult for me to internalize and explore the ideas of Chaos Engineering in everyday life. I am not a senior chaos engineer working at Netflix, Google, Facebook or Uber. The customers I support with my expertise are just beginning to grasp and implement the principles and designs of microservices. However, what I almost always found was Spring Boot! Sometimes stand-alone, sometimes packed in a Docker container. My demos, which I use in my talks at conferences, always included at least one Spring Boot application. This led to the birth of the Chaos Monkey for Spring Boot, which makes it possible to attack existing Spring Boot applications without modifying a single line of code.

Chaos Monkey for Spring Boot

All relevant information about how Chaos Monkey for Spring Boot can help you on your way to a stable Spring Boot infrastructure can be found here on GitHub .

Chaos Monkey for Spring Boot

Conclusion

I hope I could provide you with a basic understanding of the ideas and principles behind Chaos Engineering. This topic is very important and we have a lot of catching up to do, even though we don’t work at Netflix and Co. I love my job, but even more, I love my personal life with family and friends. I don’t want to spend endless evenings and weekends fixing Prio 1 tickets; Chaos Engineering gives us confidence in the performance of a system to better withstand the turbulent conditions in production.

Was this post helpful?

Blog author

Benjamin Wilms

Do you still have questions? Just send me a message.

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

Note upfront: This article is specifically aimed at teams working on the modernization and further development of existing systems, not at greenfield projects where completely different rules apply. Everyone is talking about the massive productivity ...

Generative AI
AI
DevOps
Test Driven Development
Testing

30.3.2026 | 8 minutes reading time

Nested Fixture Pattern for JUnit

JUnit's @Nested classes are usually presented as a way to group related tests. But combined with @RegisterExtension and ExtensionContext.Store, they become something more powerful: a declarative scenario tree where each level adds a scope in which fixtures...

Testing
Java
Software development

9.3.2026 | 11 minutes reading time

Rüdiger zu Dohna

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

AI agents are powerful — but they will cheat if you let them. Letting the same agent develop and test your application risks one thing: it will no longer fulfill the specification, it will simply learn to pass the tests. This article shows how to ...

AI
LLM
Testing

2.3.2026 | 12 minutes reading time

Thomas Jaspers

Hexagonal Architecture is just an island

Imagine an island called "Alistair Island." This island is a vibrant place with houses, fertile soil, and a well-coordinated community of residents who live by well-defined routines. Every activity on the island has significance and serves a specific...

Software architecture
Testing
Software development

22.1.2025 | 10 minutes reading time

Danny Keller

Charge your APIs Volume 33 - Definition-Based API Mocking, Simulation,...

Key TakeawaysThis article is the third and last one in a three-part series about definition-based API mocking, simulation, and testing with Microcks (make sure you have read the first and second article)The previous articles focused on (i) Microcks’ ...

Testing
API

23.10.2024 | 11 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 32 - Definition-Based API Mocking, Simulation,...

Key TakeawaysThis article is the second one in a three-part series about definition-based API mocking, simulation, and testing with Microcks (make sure you have read the first article)While the previous article concentrated on Microcks’ architecture,...

API
Testing

16.10.2024 | 11 minutes reading time

Dr. Florian Rademacher

Charge your APIs Volume 31 - Definition-Based API Mocking, Simulation,...

Key TakeawaysAPI mocking used, e.g., for integration testing, is challenging as it assumes conformance to mocked API functionality, which can incur significant costs as mock complexity increases with API complexityDefinition-based API mocking can reduce...

API
Testing

9.10.2024 | 9 minutes reading time

Dr. Florian Rademacher

Playwright tests and API Mocking

Problem definition Playwright tests can sometimes depend on external services such as APIs, which might happen to be unavailable at times. In this case there are several options for executing these tests adequately, as described below. Actually call ...

Testing

10.5.2024 | 4 minutes reading time

Ege Inanc

Charge your APIs Volume 25: Contract Testing

I feel the way we do integration testing is sort of like setting your house on fire to test your smoke alarm. It is excessive, tiresome and way too costly. This is not a quote from myself. I typically don't come up with such good ideas when I need....

Testing
Software development
API

2.4.2024 | 11 minutes reading time

Pasquale Brunelli

A/B Testing: Tool support and testing GrowthBook

In the previous blog post we introduced some general concepts of A/B testing: we explored the main aspects, defined test types and explained the most common statistical methods. Now we want to explore the areas in which A/B testing tools can provide...

Testing
Python
Data
UX/UI
Analysis
JavaScript

18.3.2024 | 20 minutes reading time

Francesca Diana

A/B Testing: An introduction

This blog series aims to aid teams who are contemplating adding A/B testing to their toolkit but are unsure of which tool to use. In addition to helping with tool selection, the series also provides the entire team with a consistent initial understanding...

Testing
Data
UX/UI
Analysis

6.2.2024 | 29 minutes reading time

Francesca Diana

Charge your APIs Volume 19: Understanding Problem Details for HTTP APIs...

In today's ever-changing web development landscape, HTTP APIs have become indispensable, powering a myriad of applications and services across the internet. They act as the vital communication backbone, enabling smooth data exchanges between different...

API
Resilience

30.11.2023 | 16 minutes reading time

Daniel Kocot

Count your queries! Repository integration tests with Hibernate Statistics

If you are using Spring Data JPA as a data access framework, Hibernate is almost certainly hiding under the hood. And although this setup takes a lot of work off your hands by doing a lot of awesome things, the final outcome should better be checked....

Java
Testing
Spring
Database

7.8.2023 | 6 minutes reading time

Kevin Peters

Charge your APIs Volume 6: Perfecting Your APIOps - Harnessing the Power...

Our journey through the expansive landscape of API Operations (APIOps) has led us through various territories. We've delved into Continuous Integration and Deployment, ensuring seamless transitions from coding to production-ready APIs with minimal friction...

API
Testing
GitHub

14.6.2023 | 2 minutes reading time

Daniel Kocot

Charge your APIs Volume 4: Streamlining API Operations with Continuous...

API operations refer to the maintenance and management of APIs (Application Programming Interfaces) throughout their lifecycle. This includes everything from design and development to testing, deployment, and ongoing maintenance. Continuous Integration...

Testing
API

31.5.2023 | 6 minutes reading time

Daniel Kocot

Charge your APIs Volume 3: Optimizing API Testing with Contract Testing

API testing is a crucial part of the development process that ensures the functionality, reliability, and performance of the API. Testing helps to identify and resolve errors early on, which translates into reduced development costs and improved customer...

API
Testing

24.5.2023 | 6 minutes reading time

Daniel Kocot

JavaScript test performance: getting the best out of Jest

In recent years Jest has established itself as the go-to testing framework for JavaScript and TypeScript development. It provides a complete toolkit (test runner, assertion library, mocking library, code coverage and more) out of the box, and requires...

Node.js
JavaScript
APM
Testing

12.11.2021 | 7 minutes reading time

APIOps – Automated processes for even better APIs

In my German Softwerker article (Vol. 14, p. 90) , I already dealt with the continuous design and development cycle of APIs. This was mainly about basic assumptions and tooling, including the introduction of API gateways or platforms into existing development...

DevOps
Cloud
Testing
API

28.1.2021 | 8 minutes reading time

Daniel Kocot

Green test pyramids with Cypress – UI testing of the future

Cypress is a young open-source testing framework for web-based user interfaces (UI). Cypress tests are written in JavaScript and, as is also common with Selenium-based technologies, are based on the Document Object Model (DOM) of the HTML of a web application...

Frontend
JavaScript
Testing

29.9.2020 | 8 minutes reading time

Detox vs. Appium – a comparison of React Native testing frameworks

Currently, there are especially two end-to-end testing frameworks which are interesting for React Native developers: Detox and Appium. During my internship at codecentric, I analyzed and compared both frameworks in detail, writing tests with both frameworks...

React
Testing

16.7.2020 | 5 minutes reading time

Anja Bender

Chaos Engineering – withstanding turbulent conditions in production

Introduction

Basics

Unit tests

Integration tests

Erosion of the software architecture

API heaven vs backend hell

Don’t try this at home

Rules of Chaos Engineering

Principles of Chaos Engineering

Phases of Chaos Engineering

Steady state

Hypothesis

Real-world events

Chaos experiment example

Environment

Automated experiments

Environment

First steps into Chaos Engineering

Chaos Monkey for Spring Boot

Conclusion

Was this post helpful?

Blog author

More articles in this subject area

AI Code Tsunami Hits the QA Dam: The End of Balanced Velocity

Nested Fixture Pattern for JUnit

Don't Let Your AI Cheat: Isolated Specification Testing with Claude Code

Hexagonal Architecture is just an island

Charge your APIs Volume 33 - Definition-Based API Mocking, Simulation,...

Charge your APIs Volume 32 - Definition-Based API Mocking, Simulation,...

Charge your APIs Volume 31 - Definition-Based API Mocking, Simulation,...

Playwright tests and API Mocking

Charge your APIs Volume 25: Contract Testing

A/B Testing: Tool support and testing GrowthBook

A/B Testing: An introduction

Charge your APIs Volume 19: Understanding Problem Details for HTTP APIs...

Count your queries! Repository integration tests with Hibernate Statistics

Charge your APIs Volume 6: Perfecting Your APIOps - Harnessing the Power...

Charge your APIs Volume 4: Streamlining API Operations with Continuous...

Charge your APIs Volume 3: Optimizing API Testing with Contract Testing

JavaScript test performance: getting the best out of Jest

APIOps – Automated processes for even better APIs

Green test pyramids with Cypress – UI testing of the future

Detox vs. Appium – a comparison of React Native testing frameworks