Vocabulary for Chaos Engineers
Essential chaos engineering vocabulary: blast radius, steady state, experiment hypothesis, failure injection, game day, turbulence, and more explained with examples.
Chaos engineering is the discipline of deliberately introducing failures into a system to discover weaknesses before they cause unplanned outages. It was pioneered by Netflix and has become a core practice in site reliability engineering. The vocabulary of chaos engineering is specific and precise — mastering it will help you design experiments, communicate risk to stakeholders, and participate in game days with confidence.
Chaos Engineering
Chaos engineering is the practice of running controlled experiments on a system in order to build confidence in the system’s ability to withstand turbulent, unexpected conditions in production.
“Chaos engineering is not about breaking things randomly — it is about running principled experiments that test specific hypotheses about system resilience.” “We introduced chaos engineering after our third unexpected database failover caused an outage. We needed to know whether our recovery mechanisms actually worked.”
Steady State
The steady state is the normal, healthy operating condition of a system — defined by measurable metrics such as error rate, throughput, and latency. A chaos experiment validates that the system returns to steady state after a disruption.
“Before running an experiment, we define our steady state: p99 latency under 200ms, error rate below 0.1%, and throughput above 1,000 requests per second.” “A successful chaos experiment is one where the system maintains or returns to steady state despite the injected failure. If it does not, we have found a weakness.”
Experiment Hypothesis
An experiment hypothesis in chaos engineering follows the format: “If we inject X failure, the system will maintain steady state because of Y safeguard.” The hypothesis is specific, falsifiable, and tied to measurable outcomes.
“Our hypothesis: if we terminate 50% of the instances in the payment service, the load balancer will route traffic to the remaining instances within 30 seconds and the error rate will not exceed 1%.” “We were wrong about our hypothesis. When we terminated the instances, the health checks took 90 seconds to kick in — far longer than we assumed.”
Failure Injection
Failure injection is the act of deliberately introducing a fault into a system — such as killing a process, introducing network latency, filling a disk, or corrupting a configuration file. Failure injection tools include Chaos Monkey, Gremlin, and LitmusChaos.
“We use Gremlin to inject network latency between the order service and the inventory service. We simulate 500ms of added latency and observe whether the order service’s circuit breaker triggers correctly.” “Failure injection should be gradual. Start with a small blast radius in a non-production environment before moving to production.”
Blast Radius
Blast radius refers to the scope of impact — how many users, services, or components will be affected if a particular failure occurs. Controlling the blast radius is a safety practice in chaos engineering: start small and expand incrementally.
“We limit the blast radius of our initial experiments to 5% of production traffic. If we see unexpected behaviour, we stop before it affects more users.” “The blast radius of a database failure is larger than the blast radius of a single application instance failure — we run database experiments with much more caution.”
Game Day
A game day is a scheduled, facilitated chaos engineering exercise in which an engineering team deliberately triggers failures and practises responding to them — similar to a fire drill.
“We run a quarterly game day where we simulate a range of failure scenarios: availability zone failure, external API outage, and database leader election. Each team must detect, respond, and restore within the agreed SLA.” “Game days expose gaps in runbooks, alerting, and incident response processes that would otherwise only surface in a real incident — at the worst possible moment.”
Turbulence
Turbulence is a broader metaphor used in chaos engineering to describe the unpredictable, chaotic conditions of production — unexpected traffic spikes, hardware failures, network partitions, and software bugs that combine in ways not anticipated in testing.
“Our staging environment is stable. Our production environment is turbulence. Chaos engineering helps us close the gap between what we think will happen and what actually happens.”
Abort Conditions
Abort conditions (or stop conditions) are pre-defined thresholds that automatically halt a chaos experiment if things go worse than expected. They are a safety mechanism to prevent controlled experiments from becoming uncontrolled outages.
“We set an abort condition: if the error rate exceeds 5%, the experiment is automatically halted and the injected failure is reversed.” “Never run a chaos experiment without abort conditions. The point is to find weaknesses — not to cause an incident.”
Observability in Chaos Engineering
Chaos experiments are meaningless without observability — the ability to measure the system’s behaviour during and after the experiment. You need metrics, logs, and traces to verify whether the hypothesis held.
“Before running the experiment, ensure your observability stack is in place. If you cannot measure the steady state, you cannot determine whether the experiment passed or failed.”
Practical Phrases for Chaos Engineers
- “Let’s define the steady state before we design the experiment.”
- “The hypothesis is: if we inject 200ms of latency on the payment gateway, the checkout flow degrades gracefully with a user-visible warning rather than a silent failure.”
- “We’ll start with a blast radius of 10% of traffic and expand if the system holds.”
- “The game day is scheduled for next Thursday. Teams should review their runbooks beforehand.”
- “The abort condition is: total error rate exceeds 2%. If we hit that, we stop immediately.”
- “The experiment passed — the system maintained steady state throughout the failure injection period.”
Chaos engineering vocabulary reflects the discipline’s philosophy: systematic, scientific, and safety-first. Mastering these terms will help you design rigorous experiments, communicate risk to product and business stakeholders, and build genuinely resilient systems that survive the turbulence of real-world production environments.