Design steady state hypotheses, manage blast radius, run game days, use Chaos Mesh in Kubernetes, and distinguish chaos from fault injection testing.
0 / 5 completed
1 / 5
What is a steady state hypothesis in chaos engineering?
Steady state hypothesis: without a defined steady state, you cannot tell if chaos caused an impact. Define observable metrics (Prometheus queries, SLI values) that characterise normal operation. Verify the hypothesis holds before the experiment (confirming the system is healthy), then verify it holds after the chaos injection (confirming resilience).
2 / 5
What is blast radius in chaos engineering and how should it be managed?
Blast radius: a chaos experiment that kills a random node in production should be scoped to a single availability zone, with automated rollback if the steady state hypothesis fails. Use feature flags or canary deployments to limit the percentage of traffic exposed. Start in staging, graduate to a small production subset.
3 / 5
What are game days in chaos engineering practice?
Game days: pioneered by Amazon, game days bring together on-call engineers, platform teams, and observability engineers. A facilitator injects failures (instance termination, network partition, CPU saturation) while teams respond using their standard incident process. The debrief identifies gaps in observability, runbooks, and team readiness.
4 / 5
What does Chaos Mesh provide for Kubernetes chaos engineering?
Chaos Mesh: experiments are defined as Kubernetes CRDs: PodChaos (kill pods), NetworkChaos (partition, latency, packet loss), StressChaos (CPU/memory pressure), IOChaos (disk I/O errors). They integrate with CI/CD pipelines via the Chaos Mesh API, enabling automated resilience testing on every deployment.
5 / 5
How does fault injection testing differ from traditional integration testing?
Fault injection vs integration testing: a standard integration test might verify that "the checkout service calls the payment service and records the order." A fault injection test verifies that when the payment service is slow (inject 5s delay), the checkout service returns an appropriate error to the user within its own timeout, logs correctly, and does not corrupt order state.