Practice the vocabulary of deliberately testing a system's resilience to failure.
0 / 5 completed
1 / 5
At standup, a dev mentions deliberately injecting a failure, like killing a service instance, into a production-like environment to verify the system recovers gracefully. What practice is this?
Chaos engineering deliberately injects a controlled failure, like killing a service instance, into a production-like environment to verify the broader system recovers gracefully rather than cascading into a larger outage. This proactively surfaces a weakness in the system's resilience before it's discovered during an actual unplanned incident. It's a deliberate, planned practice, distinct from an unplanned accidental outage that just happens to occur.
2 / 5
During a design review, the team wants to define upfront what normal, healthy system behavior looks like, so a chaos experiment can clearly detect a deviation from it. Which capability supports this?
Establishing a steady-state hypothesis before running an experiment defines what normal, healthy system behavior looks like, so a deviation caused by the injected failure can be clearly detected and measured. Running an experiment with no defined baseline makes it hard to tell whether an observed effect is actually caused by the injected failure or is just normal variation. This upfront hypothesis is a foundational step in a well-designed chaos experiment.
3 / 5
In a code review, a dev notices the chaos experiment is scoped to a small percentage of traffic in a controlled environment, with an automated mechanism to immediately stop the experiment if impact grows too large. What does this represent?
A blast radius limit scopes an experiment to a small, controlled portion of traffic, paired with an automated mechanism to immediately halt the experiment if its impact grows larger than expected. Running an experiment against all production traffic with no scoping risks turning a controlled learning exercise into an actual widespread outage. This scoping and automated abort capability is what makes chaos engineering safe enough to responsibly run in a real production environment.
4 / 5
An incident report shows a chaos experiment was run without an automated abort mechanism, and an injected failure cascaded into a broader outage that took far longer to recover from than intended. What practice would prevent this?
Including a scoped blast radius and automated abort mechanism before running an experiment against a live system ensures an injected failure that starts cascading unexpectedly gets stopped quickly rather than spiraling into a larger outage. Assuming a failure will always stay naturally contained ignores exactly the kind of cascading behavior chaos engineering exists to uncover. This safeguard is what separates a responsibly run chaos experiment from a reckless one.
5 / 5
During a PR review, a teammate asks why the team deliberately injects failures into a production-like environment instead of relying solely on unit and integration tests to verify system resilience. What is the reasoning?
Unit and integration tests typically run in isolation and can't fully capture how a complete, real system with all its actual dependencies behaves under a genuine failure condition. Deliberately injecting a failure into a production-like environment reveals real cascading effects and recovery behavior those isolated tests can't reach. The tradeoff is the added operational risk and care required to run this kind of experiment safely against a live-like system.