Vocabulary for Chaos Engineering: 28 Terms Every SRE Should Know
Learn the essential English vocabulary of chaos engineering — steady state, blast radius, game day, fault injection, hypothesis, and more for SRE and platform teams.
Chaos engineering is the practice of deliberately introducing failures into a system to discover weaknesses before they manifest as incidents. It is one of the most intellectually rich disciplines in site reliability engineering — and it comes with its own precise vocabulary that you need to master to participate in experiments, write runbooks, and communicate findings.
This guide covers the 28 most important chaos engineering terms with clear definitions, usage examples, and the context needed to use them accurately in technical conversations.
The Philosophy in One Sentence
“Break things deliberately, in a controlled way, before reality breaks them for you.”
Chaos engineering is not about randomly destroying infrastructure. It is a scientific discipline built around forming hypotheses, designing controlled experiments, and learning from the results.
Core Concepts
1. Steady State
The most important concept in chaos engineering. Steady state is the measurable, normal behaviour of a system when everything is working correctly. It is defined before an experiment begins and serves as the baseline for comparison.
Steady state is typically expressed in terms of observable metrics:
- Request success rate above 99.9%
- P99 latency below 300ms
- Order throughput above 500 orders/minute
Usage: “Before we ran the experiment, we established steady state: the checkout service was processing 600 orders/minute with a 99.95% success rate.”
2. Hypothesis
The prediction you make before running a chaos experiment. A well-formed hypothesis follows the pattern:
“We believe that [system] will maintain [steady state metric] even when [failure condition is introduced], because [reasoning about resilience mechanism].”
Example hypothesis: “We believe the API gateway will maintain a 99.5% success rate even when one of the three backend nodes is terminated, because the load balancer will route traffic to the remaining healthy nodes within 10 seconds.”
3. Fault Injection
The deliberate introduction of a failure into a system component. This can be done at many levels:
- Network level: packet loss, latency, bandwidth throttling
- Process level: killing a process or container
- Resource level: CPU stress, memory pressure, disk fill
- Application level: returning error responses from a dependency
Usage: “We injected a 500ms latency fault into calls from the recommendation service to the product catalogue.”
4. Blast Radius
The scope of impact of an experiment — how many users, services, or systems could be affected if the hypothesis is wrong. Before any chaos experiment, you must define and constrain the blast radius.
Usage: “We limited the blast radius to 5% of production traffic using a feature flag, so the experiment affected no more than 2,000 users.”
5. Game Day
A scheduled, collaborative chaos engineering session where a team deliberately runs failure scenarios to test the system’s resilience and the team’s response. Often includes engineers, SREs, and sometimes business stakeholders.
Usage: “We ran a game day on Thursday — we simulated a full regional outage and practiced the runbook for failover to our backup region.”
6. Experiment
A structured test that introduces a fault, observes the system’s response, and compares it against the hypothesis. A chaos experiment is not a random act of destruction — it is a scientific trial.
A complete experiment record includes:
- Hypothesis
- Steady state definition
- Fault injection method
- Observation window
- Results and conclusion
Experiment Design Vocabulary
7. Abort Condition
A pre-defined threshold that triggers automatic termination of the experiment if the system degrades beyond an acceptable level. Essential for safety.
Example: “Abort condition: if the error rate exceeds 2%, halt the experiment and restore the network immediately.”
8. Rollback Plan
The steps taken to restore the system to steady state if an experiment goes wrong or the abort condition is triggered.
Usage: “The rollback plan was simple: remove the latency rule from the network policy and verify recovery within 60 seconds.”
9. Turbulence
A general term for the disruptive conditions introduced during a chaos experiment. Some chaos engineering platforms use this term specifically.
10. Failure Mode
A specific way in which a system or component can fail. Chaos engineering systematically explores different failure modes.
Common failure modes in distributed systems:
- Service unavailability (service is down)
- Slow response (latency spike)
- Partial failure (some instances fail)
- Data corruption
- Network partition (split-brain scenario)
11. Mean Time to Recovery (MTTR)
How long it takes the system to recover after a failure is introduced. Chaos experiments often measure MTTR to assess the effectiveness of automatic recovery mechanisms.
12. Graceful Degradation
The ability of a system to continue providing reduced-but-acceptable service when some components fail, rather than failing completely. A well-designed system degrades gracefully.
Usage: “When we killed the recommendation service, the homepage degraded gracefully — users saw a static product list rather than an error page.”
13. Fallback
A secondary mechanism activated when the primary fails. Examples: a cached response, a static default, a simpler alternative service.
Usage: “The search service has a fallback to cached results — when we injected a 100% error rate into the search backend, the fallback activated within 2 seconds.”
Infrastructure and Network Terms
14. Chaos Monkey
The original chaos engineering tool created by Netflix. It randomly terminates virtual machine instances in production to ensure services can tolerate instance failure. The name has become a generic term for any tool that randomly kills infrastructure.
15. Network Partitioning
Simulating a scenario where parts of a distributed system cannot communicate with each other — a common real-world failure mode in cloud environments.
Usage: “We simulated a network partition between the primary database and its replicas to test our automatic failover procedure.”
16. Latency Injection
Adding artificial delay to network calls or service responses to test how systems behave under slow dependencies.
Usage: “We injected 2 seconds of latency into the payment provider calls to verify that our timeout and circuit breaker settings were correct.”
17. CPU Stress / Memory Pressure
Artificially consuming CPU cycles or memory on a host to simulate resource exhaustion and observe how the system responds — does it degrade, fail, or recover?
18. DNS Failure
Simulating DNS resolution failures to test whether services fail gracefully when they cannot resolve the address of a dependency.
Resilience Patterns (Vocabulary You Will Discuss During Experiments)
19. Circuit Breaker
A pattern that monitors calls to a dependency and “opens” (stops allowing calls) when failures exceed a threshold, preventing cascading failures. Named after the electrical device.
Usage: “The circuit breaker for the recommendation service opened after 50% of calls failed — the experiment confirmed our timeout of 5 seconds was too long.”
20. Retry with Exponential Backoff
A pattern where failed requests are retried after progressively longer delays, reducing load during partial outages.
21. Bulkhead
A pattern that isolates components so that failure in one does not exhaust resources in another. Named after the watertight compartments in a ship.
Usage: “The bulkhead pattern prevented the slow search service from consuming all thread pool capacity — the checkout flow remained unaffected.”
22. Timeout
The maximum time a service will wait for a response before giving up and returning an error. Chaos experiments frequently reveal incorrectly configured timeouts.
23. Cascading Failure
A scenario where the failure of one component causes dependent components to fail in a chain reaction. Chaos engineering is specifically designed to discover and prevent cascading failures.
Observability Terms (Used During and After Experiments)
24. Observability
The ability to understand the internal state of a system from its external outputs — metrics, logs, and traces. Without observability, chaos experiments cannot be evaluated.
25. Canary Analysis
Comparing the behaviour of a small subset of traffic (the “canary”) against the baseline during an experiment. Allows detection of regressions with minimal user impact.
26. Error Budget
The acceptable amount of downtime or failure defined by a service level objective (SLO). Chaos experiments should ideally be run when the error budget is healthy.
Usage: “We paused game day activities because our error budget was at 20% for the month — running experiments would have risked breaching the SLO.”
27. Post-Experiment Review
The structured debrief after a chaos experiment — what was the hypothesis, what happened, what was learned, and what changes will be made. Similar in structure to a blameless post-mortem.
28. Resilience Score
A quantitative measure of how well a system or service handled a given failure scenario. Not a standard metric, but used informally by many teams to track improvement over time.
Key Takeaways
- Steady state is the baseline — define it in measurable terms before any experiment.
- A hypothesis is required: chaos engineering is science, not random destruction.
- Blast radius must be explicitly constrained — start small, in staging, with abort conditions.
- Game days bring the team together to practice failure response as a collaborative exercise.
- Core resilience vocabulary — circuit breaker, bulkhead, fallback, graceful degradation — describes the mechanisms chaos experiments are designed to test.
- Always plan the rollback before running the experiment, not after something goes wrong.
Chaos engineering makes your systems more reliable by making failure a first-class concern. Knowing this vocabulary lets you lead those conversations and write the reports that drive real improvements.