English for Chaos Engineers: Vocabulary for Resilience Testing

Learn the vocabulary chaos engineers use in standups, GameDays, and post-mortems — from fault injection and blast radius to Chaos Monkey, Gremlin, and MTTR measurement.

Chaos engineering is the discipline of deliberately breaking systems to discover weaknesses before they manifest as outages. It is practiced by Netflix, Amazon, Google, and increasingly by any engineering organisation serious about reliability. The vocabulary of chaos engineering is precise and intentional — using it correctly signals that you understand not just the tools but the underlying philosophy.

The Core Concepts

Chaos engineering is defined as the discipline of experimenting on a system in order to build confidence in its ability to withstand turbulent conditions in production. The key word is “experiment” — chaos engineering is not random destruction. It is a scientific process with hypotheses, controls, and measured outcomes.

The process begins with defining the steady state — the normal, measurable behaviour of your system under typical load. This might be p99 latency under 200ms, or an error rate below 0.1%, or a specific throughput level. Without a defined steady state, you cannot measure whether an experiment caused a meaningful change.

A chaos experiment is a single, controlled test of a specific failure mode. It has four components: a hypothesis (“if we kill one pod in the auth service, the login flow will remain available via the remaining replicas”), the method (how you will inject the failure), the blast radius (which users, services, or regions will be affected), and the rollback plan (how you will restore normal conditions if something goes wrong unexpectedly).

Blast radius is one of the most important terms in chaos engineering. It defines the scope of impact — intentionally. Engineers say: “We’re limiting the blast radius to 5% of traffic in one region before we expand the experiment scope.”

Fault Injection Techniques

Fault injection is the act of introducing failures into a system. Common types:

  • Latency injection — adding artificial delay to network calls to simulate slow dependencies (“We’re injecting 500ms of latency on the database connection to see how the API behaves.”)
  • Network partition test — cutting network connectivity between components to simulate split-brain scenarios
  • Pod killing (in Kubernetes) — randomly terminating pods to verify that the deployment handles restarts gracefully
  • Resource exhaustion — consuming CPU, memory, or disk to simulate contention

Tools include Chaos Monkey (Netflix’s original tool, randomly terminates instances), Gremlin (a commercial platform with a wide range of attack types and a safety-first design), and LitmusChaos (a CNCF project for Kubernetes-native chaos experiments).

GameDays and Hypothesis-Based Testing

A GameDay is a scheduled chaos exercise where the engineering team runs a set of experiments together, often involving SREs, developers, and on-call responders. GameDays are used to test incident response processes, not just technical resilience. You might say: “The GameDay last quarter revealed that our runbooks were outdated — the team couldn’t recover the service within the SLO window.”

Hypothesis-based testing means every chaos experiment starts with a written hypothesis about expected system behaviour. This is what distinguishes chaos engineering from simply breaking things. “Our hypothesis is that disabling one availability zone will not degrade user-facing error rates above 1%, because our load balancer is configured for multi-AZ failover.”

Failure mode refers to the specific way a component can fail — timeouts, crashes, corrupted responses, resource exhaustion. Good chaos experiments enumerate the failure modes relevant to each dependency.

Redundancy testing validates that your redundancy mechanisms — replicas, fallback services, circuit breakers — actually work as designed under real failure conditions.

Measurement and Post-Mortems

Resilience is the system’s ability to absorb failures and recover to its steady state. Chaos engineering measures resilience quantitatively. MTTR (Mean Time to Recovery) is the average time it takes the system to return to normal after a failure — chaos experiments provide a controlled environment to measure and improve this metric.

In standups, chaos engineers speak concisely: “The latency injection experiment showed the retry logic is causing a thundering herd — we’re adding jitter to the backoff.” In post-mortems, they write: “The experiment revealed that our circuit breaker threshold was set too high — it did not trip until 80% of requests were failing, by which point the cascading failure had already propagated.”

Next Steps

If you have not run a chaos experiment before, start with a small blast radius: inject 200ms of latency on one internal API call in a staging environment and observe the behaviour of the dependent service. Write the experiment as a one-page document with hypothesis, method, blast radius, and expected outcome — in English. The discipline of writing it forces clarity of thought.