Tip 4: Frame experiments scientifically: hypothesis → inject failure → observe → compare to steady state → conclude
0 / 5 completed
1 / 5
The interviewer asks: "What is a steady-state hypothesis in chaos engineering and why is it the starting point for every experiment?" Which answer best demonstrates chaos engineering methodology?
Option B defines the concept precisely (measurable metrics with thresholds), gives a concrete example, and explains four reasons why it is foundational — including the automated abort condition which is the most production-critical aspect. Key structure: measurable normal behaviour (specific metric + threshold + time window) → reasons: deviation detection requires baseline; missing observability discovered; binary pass/fail; automated abort condition → without it: uncontrolled breakage not science. Option A is vague (a "guess"). Option C describes runbook documentation, not a measurable hypothesis. Option D confuses the hypothesis with the expected outcome.
2 / 5
The interviewer asks: "What is blast radius in chaos engineering and how do you minimise it when running experiments in production?" Which answer best demonstrates safe chaos engineering practice?
Option B defines blast radius as a scope concept (not just server count), provides six concrete minimisation strategies, and frames it as a risk management parameter. Key structure: blast radius = maximum impact scope → minimise: one instance/1% traffic → time-boxed with auto-rollback → feature-flag/header traffic isolation → automated chaos platform rollback → off-peak timing → graduated escalation (staging → 1% → 10% → full) → risk management not just size. Option A is accurate but shallow. Option C reduces blast radius to server count. Option D focuses on approval process, not technical minimisation.
3 / 5
The interviewer asks: "Walk me through designing a chaos experiment for a payment service that depends on an external payment gateway." Which answer best demonstrates experiment design skill?
Option B applies all six chaos engineering steps with domain-specific content: a measurable steady-state hypothesis with thresholds, three failure scenario types (outage, latency, error), tooling (Gremlin/Toxiproxy), specific safeguards, and observation metrics. Key structure: steady-state hypothesis (error rate <5%, checkout >85%, 30s recovery) → three failure types: total outage + latency injection + error response → Gremlin/Toxiproxy → 5% traffic + auto-abort if SLO breach → observe: error rate + P99 + circuit breaker state + retry count → conclude. Option A describes ad hoc testing without a hypothesis or safeguards. Option C is staging-only testing (valid but not production chaos). Option D is a unit test, not a chaos experiment — it does not test production behaviour.
4 / 5
The interviewer asks: "What is the difference between chaos engineering and traditional testing?" Which answer best demonstrates chaos engineering conceptual clarity?
Option B defines both precisely and draws four specific contrasts. Key structure: traditional testing: known scenarios + deterministic + component scope + assert outcomes + isolation; chaos: unknown failure conditions + emergent interactions + system-level + steady-state hypothesis + production-like environment + discovers unknown unknowns → they complement, not replace. Option A is directionally correct on environment but misses all conceptual differences. Option C incorrectly associates chaos engineering with machine learning. Option D mischaracterises both (traditional testing is often automated; chaos does not replace QA).
5 / 5
The interviewer asks: "What is a game day in chaos engineering and how do you run one effectively?" Which answer best demonstrates operational chaos maturity?
Option B describes a complete game day with pre-game planning, roles, scenario types, live execution discipline, abort criteria, and a structured retrospective with outcome artefacts. Key structure: pre-game: scenarios + roles (injector + observability + IC + safety controller) + dashboard prep → scenarios: known weaknesses + novel failures → live: one failure at a time + real-time documentation → abort criteria + safety controller authority → post-game: timeline review + gap analysis + action items → outcome: prioritised resilience backlog → cadence: quarterly + continuous CI chaos. Option A describes the concept correctly but gives no structure. Option C reduces game day to automated tests without the organisational exercise component. Option D describes a tabletop exercise — valuable but different from a game day (no actual failure injection).