Chaos Engineering
Most teams test that their software works when everything is healthy. Chaos engineering tests the opposite: it deliberately breaks things on purpose to see whether your system survives. The goal is not to cause damage, it is to find weak spots before your users do — while you are watching, in daylight, with a way to stop. If your service is going to fall over when a server dies, you want to discover that during a controlled experiment, not at 3am during a real outage.
What chaos engineering actually is
Chaos engineering is the practice of running controlled experiments on a running system to learn how it behaves under failure. The idea was made famous by Netflix, who built a tool called Chaos Monkey — a program that randomly turns off virtual servers in their production environment during working hours. The logic is simple but bold: if a random machine can disappear at any moment and nobody notices, then the system is genuinely resilient (able to keep working when parts of it fail). If something breaks, the team learns about it on a Tuesday afternoon instead of during a holiday weekend.
This matters because real distributed systems fail in ways you cannot predict by reading code. A database connection times out, a disk fills up, a network link gets slow, a dependency goes away. Chaos engineering forces these events to happen on your terms so you can fix the cracks they reveal.
Chaos engineering is not “randomly breaking production and hoping.” Done wrong it is just an outage you caused yourself. Every experiment needs a hypothesis, a small blast radius, and an instant stop button. Earn the right to test in production by first testing in staging.
The hypothesis-driven experiment
A good chaos experiment is a scientific experiment, not a tantrum. It follows four steps.
- Define steady state. Decide what “healthy” looks like as a measurable number — for example, “99% of requests return in under 300ms” or “checkout success rate stays above 99.5%”. This is your SLI (Service Level Indicator — the actual measured signal of health).
- Form a hypothesis. Write down what you believe will happen. For example: “If one of our three web servers dies, the load balancer (a server that spreads traffic across many backends) will route around it and steady state will hold.”
- Inject the failure. Cause the real-world event — kill a process, add latency, fill a disk.
- Compare and learn. Did steady state hold? If yes, you have proof of resilience. If no, you have found a bug worth fixing.
The key word is hypothesis. You are not breaking things to break them; you are testing a specific belief about your system and either confirming it or being proven wrong.
Common experiments
These are the failures teams inject most often. You can run all of them on a single Ubuntu server to learn the mechanics before going near production.
| Experiment | What it simulates | When to use it |
|---|---|---|
| Kill a process / instance | A server or container crashing | Verify failover and auto-restart work |
| Add network latency | A slow database or slow API | Test timeouts and retry logic |
| Drop network packets | A flaky network link | Check the app degrades, not collapses |
| Fill the disk | /var or /tmp running out of space | Confirm alerts fire and the app survives |
| Burn CPU | A noisy neighbour or runaway job | Test autoscaling and throttling |
Killing a process
The simplest experiment is turning something off. On Ubuntu, suppose your app runs under systemd (the standard service manager on Ubuntu that starts and restarts background programs).
# See the service and its restart policy
sudo systemctl status myapp.service
Output:
● myapp.service - My web application
Loaded: loaded (/etc/systemd/system/myapp.service; enabled)
Active: active (running) since Mon 2026-06-15 09:14:02 UTC
Main PID: 1843 (node)
Now kill it abruptly, the way a crash would, and watch what happens:
sudo kill -9 1843
sleep 2
sudo systemctl status myapp.service
Output:
Active: active (running) since Mon 2026-06-15 09:21:48 UTC
Main PID: 2104 (node)
A new PID (process ID — the number Linux gives each running program) means systemd restarted the app automatically. That only works if your service file has Restart=always. If it did not come back, you just found a real gap.
Adding network latency
To simulate a slow network, use tc (traffic control — a built-in Linux tool for shaping network behaviour). Add 200 milliseconds of delay to every outgoing packet on the main network interface:
# Add 200ms delay to all egress traffic on eth0
sudo tc qdisc add dev eth0 root netem delay 200ms
Run your normal health check while the delay is active and watch whether timeouts and retries behave. Then always remove it — this is your stop button:
# Undo the delay (the abort)
sudo tc qdisc del dev eth0 root netem
The
tc ... delline is the most important command in the whole experiment. Before you ever run an injection, know exactly how to reverse it, and have that command ready in another terminal. An experiment you cannot stop is an outage.
Blast radius — doing it safely
The blast radius is how much of your system an experiment can affect if it goes wrong. The golden rule of chaos engineering is: start with the smallest possible blast radius and grow it only after each step succeeds.
A safe progression looks like this:
1. Run the experiment in staging, not production.
2. In production, target ONE server out of many.
3. Limit it to a fraction of traffic (e.g. 1% of users).
4. Run during business hours, with the team watching.
5. Only widen the radius after the small test passes.
6. Keep the abort/stop command one keystroke away.
Tooling helps you stay safe at scale. Modern options in 2026 include Chaos Mesh and LitmusChaos for Kubernetes (a system for running containers across many machines), and Gremlin as a commercial platform with built-in safety guards. These let you scope an experiment to specific pods, set a time limit, and halt everything instantly.
When to use chaos engineering — and when not to
Chaos engineering is worth it once your system is distributed, important, and already monitored. You need good dashboards and alerts first, because an experiment is useless if you cannot see its effect.
Do not start here if you are a small app on a single server with no monitoring, or if you already have a long list of known reliability bugs — fix the failures you already know about before hunting for new ones. Chaos engineering is for proving resilience you believe you have, not for skipping the basics.
Best practices
- Always start with a hypothesis. If you cannot say what you expect to happen, you are not ready to run the experiment.
- Monitor first, chaos second. You need dashboards and alerts in place before any injection, or you will learn nothing.
- Keep the blast radius tiny at first. One server, a little traffic, business hours — then grow.
- Have an abort button ready. Know and pre-stage the exact command that stops the experiment instantly.
- Begin in staging. Earn your way to production experiments by proving the process works in a safe environment.
- Automate once it is stable. Tools like Chaos Mesh let you run experiments continuously, the way Netflix runs Chaos Monkey.
- Write down what you learned. Every weakness found is a ticket to fix, the same as a real incident.