5 exercises — choose the best-structured answer to Chaos Engineering interview questions. Focus on steady-state hypotheses, blast radius, game days, and cascading failure testing.
What separates good from great chaos engineering answers
Be falsifiable: hypotheses need exact numbers — latency thresholds, percentiles, time windows
Slow is worse than dead: latency injection reveals more than kill injection
Human behaviour matters: game days test people and process, not just systems
Monitoring blind spots: an undetectable failure is more dangerous than an unpreventable one
0 / 5 completed
1 / 5
The interviewer asks: "What is a steady-state hypothesis and why must you define one before running a chaos experiment?" Which answer is the most precise?
Option B is the strongest: gives a concrete, fully-formed example of a hypothesis with exact numbers (p99 latency, percentile threshold, window size), explains its dual purpose (baseline measurement AND experiment guard), and adds the operationally crucial point about CI automation requiring binary pass/fail. This last insight — that a vague hypothesis can't be automated — shows production experience. Option A describes the concept correctly but incompletely. Option C re-phrases the question as the answer. Option D is accurate but reads like a definition without the nuance of why the precision of the hypothesis matters.
2 / 5
The interviewer asks: "How do you scope a chaos experiment to limit blast radius?" Choose the most operationally mature answer.
Option B is the strongest: structures blast radius scoping across four explicit dimensions (environment, user scope, magnitude, abort conditions), gives concrete examples for each (canary segment, feature flags, single AZ injection, automated kill switches), and adds the critical counter-intuitive point that staging-only chaos gives false confidence. Option A is the most common naive answer and explicitly contradicted by the best answer. Option C names tools (Chaos Mesh, Gremlin) but reduces blast radius scoping to a tool configuration problem. Option D describes gradual increase but misses the systematic four-dimension framework and the abort condition mechanism. Four-dimension frameworks are memorable and show structured thinking.
3 / 5
The interviewer asks: "What is a game day and how do you run one effectively?" Which answer demonstrates real facilitation experience?
Option B is the strongest: introduces a three-phase structure (design → execution → retrospective), explains scenario selection criteria (plausible + high impact + never exercised), adds the critical non-obvious insight that a good facilitator observes human behaviour not just dashboards, provides a three-category retrospective taxonomy (system / process / knowledge gaps), and ends with the most memorable principle: a game day without written action items with owners produces learning but no improvement. Option A describes the concept but has no methodology. Option C is accurate but procedural — a checklist not a framework. Option D names Chaos Monkey but does not explain facilitation methodology.
4 / 5
The interviewer asks: "How do you test for cascading failures in a microservices architecture?" Choose the strongest answer.
Option B is the strongest: introduces dependency graph mapping as the prerequisite, makes the key distinction between latency injection and kill (slow dependencies are more dangerous than dead ones), explains the mechanism precisely (thread pool and connection queue exhaustion before circuit breakers trip), names specific tools (Toxiproxy, Resilience4j), lists three specific validations (circuit breaker threshold, timeout enforcement, graceful degradation), and ends with the most sophisticated insight — retry storms as a cascade amplifier. Option A is too simplistic. Option C states the correct tools but not the analysis depth. Option D mentions Istio fault injection correctly but misses the latency vs kill distinction and retry storm risk.
5 / 5
The interviewer asks: "How do you decide which failure scenarios are worth experimenting on?" Which answer shows the most strategic prioritisation?
Option B is the strongest: names the two axes of the risk matrix explicitly, gives concrete sources for likelihood estimation (incident history, architecture risk review, cloud provider failure patterns like AZ outages), adds the non-obvious insight about low-likelihood, high-blast-radius scenarios (most dangerous because never exercised), and most distinctively adds the monitoring blind spot dimension — a failure you cannot detect is more dangerous than one you cannot prevent. The closing point about producing a ranked backlog rather than an ad hoc list shows process maturity. Option A is correct but relies on informal intuition. Option C describes the same risk matrix but without the monitoring blind spot insight. Option D is purely reactive (incidents drive prioritisation) and misses proactive architecture risk analysis.