Chaos Engineering English: Vocabulary for GameDays and Resilience Testing

Learn the English vocabulary chaos engineers use — steady-state hypothesis, blast radius, fault injection, GameDays — with example sentences and writing patterns.

Chaos engineering is the practice of deliberately introducing failures into a system to discover weaknesses before they cause real incidents. Teams that practise chaos engineering have a specific vocabulary — and writing a clear experiment design document requires knowing it. This guide covers the core terms and communication patterns used in chaos engineering.

Core Chaos Engineering Vocabulary

TermDefinition
Steady-state hypothesisA description of the system’s normal behaviour that you expect to hold true during the experiment
Blast radiusThe scope of potential impact if the experiment causes unexpected harm
Fault injectionDeliberately introducing failures (latency, errors, resource exhaustion) into a system
GameDayA scheduled event where a team runs chaos experiments together, often in production
Abort conditionA pre-defined trigger that stops the experiment if things go wrong unexpectedly
HypothesisYour prediction of how the system will behave under the introduced fault
ObservabilityThe ability to understand the internal state of a system from its external outputs
ResilienceThe ability of a system to absorb failures and continue to function
FallbackA backup behaviour that activates when the primary path fails

The Steady-State Hypothesis

The steady-state hypothesis is the most important concept in chaos engineering. Before running any experiment, you must define what “normal” looks like — so you can tell whether the experiment has caused a problem.

Format: The steady-state hypothesis should be measurable and observable.

Good examples:

  • “The checkout API returns HTTP 200 for 99.9% of requests with a p99 latency below 200 ms.”
  • “The recommendation service returns a non-empty list for at least 95% of requests.”

Weak examples:

  • “The system works normally.” (not measurable)
  • “Users can complete purchases.” (not directly observable without instrumentation)

Describing Blast Radius

The blast radius determines how cautious your experiment needs to be. Narrow the blast radius before expanding.

Blast radiusMeaning
Single instanceOnly one service instance is affected
Availability zoneAn entire data centre zone is simulated as failed
Production trafficReal user traffic is affected
Read-only trafficOnly read operations are affected, writes are protected

Phrases for blast radius discussions:

  • “We’ll start with a blast radius of a single instance and expand to an AZ once we’re confident in the steady-state hypothesis.”
  • “The blast radius must be limited to internal traffic for the first run to protect paying customers.”

Writing an Experiment Design Document

A chaos experiment design document typically includes:

  1. Objective — Why are you running this experiment?
  2. Steady-state hypothesis — What does normal look like?
  3. Fault to be injected — What failure will you introduce?
  4. Blast radius — What is the maximum potential scope of impact?
  5. Abort conditions — When will you stop the experiment?
  6. Observations — What metrics will you monitor?
  7. Expected outcome — What do you predict will happen?
  8. Actual outcome — What actually happened? (Filled in after the experiment)

Sample experiment objective: “Validate that the payment service degrades gracefully when the downstream fraud-detection service becomes unavailable, by verifying that transactions fall back to a safe-hold queue rather than failing with a 500 error.”

GameDay Vocabulary

A GameDay is a collaborative exercise where a team runs chaos experiments together. The language used before, during, and after a GameDay follows specific patterns.

Pre-GameDay:

  • “We’ll be conducting a GameDay on Thursday — the target is the recommendation pipeline.”
  • “Please ensure your on-call phone is active and that you’ve reviewed the abort conditions.”

During the GameDay:

  • “Fault injection has been initiated. Monitoring dashboards are live.”
  • “We’ve hit an abort condition — stopping the experiment now.”
  • “The system is recovering as expected — steady-state metrics are returning to baseline.”

Post-GameDay:

  • “The hypothesis held — fallback behaviour activated correctly within 3 seconds.”
  • “We identified a previously unknown failure mode: the circuit breaker was not configured on the cache layer.”

Example Sentences

  1. “The steady-state hypothesis for this experiment is that the API gateway maintains a p99 latency below 400 ms and a success rate above 99% under normal load.”
  2. “We will limit the blast radius to a single availability zone and exclude production checkout traffic for the first iteration of this experiment.”
  3. “The abort condition is triggered if the error rate on any customer-facing service exceeds 5% for more than 30 seconds.”
  4. “During the GameDay, fault injection revealed that the retry logic in the notification service was not respecting the circuit breaker state, resulting in a thundering herd during recovery.”
  5. “The experiment confirmed our hypothesis: when the inventory service is unavailable, the product page degrades gracefully by showing cached stock information rather than returning an error page.”