Chaos engineering is the practice of running thoughtful, controlled experiments on a system to build confidence in its resilience. Rather than waiting for a 3 a.m. incident to reveal that a service has no fallback, you deliberately introduce failure — kill a node, add latency, drop a dependency — in a controlled way and observe whether the system degrades gracefully. Pioneered at Netflix (Chaos Monkey), the goal is not chaos for its own sake but turning unknown failures into known, tested-for failures.
2 / 5
What is a "steady-state hypothesis" in a chaos experiment?
Every well-formed chaos experiment starts by defining the steady state: a measurable output that indicates the system is healthy — for example, "orders complete at ≥99% success and p99 latency <300ms." This becomes the hypothesis: "When we inject failure X, the steady state will hold." You then inject the failure and check whether the metric stays within bounds. If it does, you have evidence of resilience; if it does not, you have found a real weakness to fix. Defining steady state in business/user terms (not just CPU) keeps experiments meaningful.
3 / 5
What does "blast radius" mean in chaos engineering?
The blast radius is the extent of potential damage from an experiment. A core principle of responsible chaos engineering is to minimize blast radius: start with the smallest possible scope (one instance, 1% of traffic, a staging environment) and expand only as confidence grows. This limits the harm if the system fails the experiment. Combined with an abort condition (a "big red button" to stop instantly if real users are harmed), controlling the blast radius is what separates disciplined chaos engineering from reckless breakage.
4 / 5
What is a "GameDay" in resilience practice?
A GameDay is a planned event where engineers run failure scenarios — often in production or a production-like environment — to validate resilience and practice incident response. Beyond testing the system, GameDays test the humans and processes: Do the runbooks work? Do alerts fire? Does the on-call know what to do? Are dashboards useful under stress? Teams often run them like fire drills, sometimes with a "master of disaster" injecting surprises. The outcome is a list of action items: missing alerts, broken fallbacks, unclear runbooks.
5 / 5
Why is it valuable to run chaos experiments in production rather than only in staging?
Staging is a useful starting point, but it is almost never a faithful replica of production: real traffic volume, data distribution, third-party integrations, autoscaling behavior, and configuration drift all differ. Many resilience bugs only manifest at production scale or with production data. Mature chaos engineering therefore extends — carefully, with small blast radius and abort conditions — into production. The risk is managed, not eliminated: you accept a small, controlled risk now to avoid a large, uncontrolled outage later. This is the same logic as a vaccine.