Chaos Engineering English: Vocabulary for GameDays and Resilience Testing
Learn the English vocabulary chaos engineers use — steady-state hypothesis, blast radius, fault injection, GameDays — with example sentences and writing patterns.
Chaos engineering is the practice of deliberately introducing failures into a system to discover weaknesses before they cause real incidents. Teams that practise chaos engineering have a specific vocabulary — and writing a clear experiment design document requires knowing it. This guide covers the core terms and communication patterns used in chaos engineering.
Core Chaos Engineering Vocabulary
| Term | Definition |
|---|---|
| Steady-state hypothesis | A description of the system’s normal behaviour that you expect to hold true during the experiment |
| Blast radius | The scope of potential impact if the experiment causes unexpected harm |
| Fault injection | Deliberately introducing failures (latency, errors, resource exhaustion) into a system |
| GameDay | A scheduled event where a team runs chaos experiments together, often in production |
| Abort condition | A pre-defined trigger that stops the experiment if things go wrong unexpectedly |
| Hypothesis | Your prediction of how the system will behave under the introduced fault |
| Observability | The ability to understand the internal state of a system from its external outputs |
| Resilience | The ability of a system to absorb failures and continue to function |
| Fallback | A backup behaviour that activates when the primary path fails |
The Steady-State Hypothesis
The steady-state hypothesis is the most important concept in chaos engineering. Before running any experiment, you must define what “normal” looks like — so you can tell whether the experiment has caused a problem.
Format: The steady-state hypothesis should be measurable and observable.
Good examples:
- “The checkout API returns HTTP 200 for 99.9% of requests with a p99 latency below 200 ms.”
- “The recommendation service returns a non-empty list for at least 95% of requests.”
Weak examples:
- “The system works normally.” (not measurable)
- “Users can complete purchases.” (not directly observable without instrumentation)
Describing Blast Radius
The blast radius determines how cautious your experiment needs to be. Narrow the blast radius before expanding.
| Blast radius | Meaning |
|---|---|
| Single instance | Only one service instance is affected |
| Availability zone | An entire data centre zone is simulated as failed |
| Production traffic | Real user traffic is affected |
| Read-only traffic | Only read operations are affected, writes are protected |
Phrases for blast radius discussions:
- “We’ll start with a blast radius of a single instance and expand to an AZ once we’re confident in the steady-state hypothesis.”
- “The blast radius must be limited to internal traffic for the first run to protect paying customers.”
Writing an Experiment Design Document
A chaos experiment design document typically includes:
- Objective — Why are you running this experiment?
- Steady-state hypothesis — What does normal look like?
- Fault to be injected — What failure will you introduce?
- Blast radius — What is the maximum potential scope of impact?
- Abort conditions — When will you stop the experiment?
- Observations — What metrics will you monitor?
- Expected outcome — What do you predict will happen?
- Actual outcome — What actually happened? (Filled in after the experiment)
Sample experiment objective: “Validate that the payment service degrades gracefully when the downstream fraud-detection service becomes unavailable, by verifying that transactions fall back to a safe-hold queue rather than failing with a 500 error.”
GameDay Vocabulary
A GameDay is a collaborative exercise where a team runs chaos experiments together. The language used before, during, and after a GameDay follows specific patterns.
Pre-GameDay:
- “We’ll be conducting a GameDay on Thursday — the target is the recommendation pipeline.”
- “Please ensure your on-call phone is active and that you’ve reviewed the abort conditions.”
During the GameDay:
- “Fault injection has been initiated. Monitoring dashboards are live.”
- “We’ve hit an abort condition — stopping the experiment now.”
- “The system is recovering as expected — steady-state metrics are returning to baseline.”
Post-GameDay:
- “The hypothesis held — fallback behaviour activated correctly within 3 seconds.”
- “We identified a previously unknown failure mode: the circuit breaker was not configured on the cache layer.”
Example Sentences
- “The steady-state hypothesis for this experiment is that the API gateway maintains a p99 latency below 400 ms and a success rate above 99% under normal load.”
- “We will limit the blast radius to a single availability zone and exclude production checkout traffic for the first iteration of this experiment.”
- “The abort condition is triggered if the error rate on any customer-facing service exceeds 5% for more than 30 seconds.”
- “During the GameDay, fault injection revealed that the retry logic in the notification service was not respecting the circuit breaker state, resulting in a thundering herd during recovery.”
- “The experiment confirmed our hypothesis: when the inventory service is unavailable, the product page degrades gracefully by showing cached stock information rather than returning an error page.”