Advanced 12 terms

Chaos Engineering

Vocabulary for designing and running chaos experiments to verify system resilience.

Practice exercises → All vocabulary sets

Steady State /ˈstedi steɪt/

The normal, measurable behaviour of a system when it is functioning correctly — defined by specific metrics (e.g. p99 latency, error rate, throughput). Chaos experiments validate that the system returns to steady state after a disruption.

"Before injecting any faults, we defined our steady state: error rate < 0.1%, p99 latency < 250ms. After the experiment, we confirmed the system returned to these values within 90 seconds of fault removal."
Chaos Hypothesis /ˈkeɪɒs haɪˈpɒθɪsɪs/

A falsifiable statement predicting that the system will maintain its steady state when a specific adverse condition is introduced. Forms the scientific basis of a chaos experiment.

"Our chaos hypothesis: 'If one of three availability zones loses network connectivity, the load balancer will route traffic to the remaining zones and the error rate will stay below 1%.' The experiment proved the hypothesis false — a misconfigured health check caused 15% of requests to fail."
Blast Radius /blɑːst ˈreɪdiəs/

The scope of potential harm if a chaos experiment or production failure goes wrong — in terms of affected users, services, or data. Minimising blast radius is essential when running experiments in production.

"We start every new chaos experiment with a blast radius of one instance in one availability zone, serving 2% of traffic. Only after repeated successes do we expand to larger blast radii."
Fault Injection /fɔːlt ɪnˈdʒekʃən/

Deliberately introducing failures — such as latency, packet loss, CPU pressure, disk errors, or process crashes — into a running system to observe how it responds.

"We used fault injection to simulate 500ms of added latency on calls from the checkout service to the inventory service — the circuit breaker opened as expected, returning cached stock data instead of failing."
Game Day /ɡeɪm deɪ/

A planned, structured exercise where a team deliberately runs failure scenarios against their systems — often with the broader team observing — to test resilience, incident response, and runbooks simultaneously.

"The game day simulated a primary database failure during peak traffic hours. We discovered the automated failover worked, but the app's connection pool didn't drain properly — 30 seconds of errors before recovery. We fixed it before it happened in production."
Circuit Breaker /ˈsɜːkɪt ˈbreɪkər/

A resilience pattern that monitors calls to a downstream service and, after a threshold of failures, 'opens' the circuit — fast-failing all requests instead of waiting for timeouts, giving the downstream time to recover.

"The circuit breaker opened after 5 consecutive failures calling the recommendations service — subsequent requests received a cached fallback response in 2ms instead of timing out after 30 seconds each."
Graceful Degradation /ˈɡreɪsfʊl ˌdeɡrəˈdeɪʃən/

A design principle where a system continues to provide core functionality with reduced features or quality when some components fail, rather than failing completely.

"When the personalisation service was down, the homepage gracefully degraded to showing trending content instead of personalised recommendations — users still had a functional experience, just not a tailored one."
Fallback /ˈfɔːlbæk/

An alternative response or behaviour that a system uses when the primary path fails — such as returning cached data, a default value, or a simplified response.

"The pricing service fallback returns the last known price from a local cache if the pricing API is unreachable — orders can still be placed, and prices are reconciled in a background job."
Bulkhead Pattern /ˈbʊlkhæd ˈpætən/

A resilience pattern that isolates components of a system into separate resource pools (thread pools, connection pools) so that a failure or overload in one does not exhaust resources needed by others.

"After the reporting service exhausted the shared database connection pool and took down the entire API, we applied the bulkhead pattern: separate connection pools per service tier. Reporting can now saturate its pool without affecting transactional queries."
Monkey Testing /ˈmʌŋki ˈtestɪŋ/

Testing by introducing random, unexpected inputs or actions into a system to discover edge cases and failure modes. Named for the metaphor of a monkey randomly pressing buttons — now extended metaphorically to infrastructure chaos.

"We run monkey testing on our message queue consumers by randomly killing consumer processes — this revealed that duplicate message handling was incomplete, causing occasional double-charges."
Chaos Monkey /ˈkeɪɒs ˈmʌŋki/

Netflix's original chaos engineering tool that randomly terminates virtual machine instances in production to ensure the system tolerates instance failure. The concept spawned the broader Simian Army and the chaos engineering discipline.

"Netflix's Chaos Monkey runs continuously in production. The philosophy: if your system can't survive random instance termination on any day, it will definitely fail when a real incident happens at the worst possible time."
Resilience /rɪˈzɪliəns/

A system's ability to absorb disruptions, adapt to adverse conditions, and recover to its steady state — often measured by time to detect (TTD), time to mitigate (TTM), and the blast radius of failures.

"Chaos engineering improved our system's resilience measurably: mean time to recovery fell from 12 minutes to 3 minutes after we fixed the issues our experiments exposed over a quarter."

Ready to practice?

Test your knowledge of these terms in the interactive exercise.

Start exercise →

Chaos Engineering

Ready to practice?

Quick Quiz — Chaos Engineering

Quiz complete!