Learn the vocabulary of system resilience engineering: fault tolerance, graceful degradation, circuit breaker behavior observed in chaos experiments, mean time between failures (MTBF), and how to discuss these concepts precisely.
0 / 5 completed
1 / 5
An engineer says: 'Our system is fault tolerant to dependency failures.' What does this mean?
Fault tolerance is not immunity to failure — it is the ability to continue operating when failures occur. It requires design choices: retries, circuit breakers, fallbacks, redundancy, and graceful degradation paths. In chaos experiments, fault tolerance is verified by injecting dependency failures and confirming the system continues serving users acceptably. 'Fault tolerant' systems have explicit, tested failure paths — not just hope that dependencies won't fail.
2 / 5
What is the difference between 'fault tolerance' and 'graceful degradation' in resilience vocabulary?
The distinction matters for setting expectations: 'fault tolerant to payment service failure' might mean the system serves all requests normally (using a fallback). 'Gracefully degrades when payment service fails' means the system continues running but the checkout feature returns a user-friendly error rather than crashing the whole application. Graceful degradation is often more realistic and valuable than full fault tolerance for complex systems.
3 / 5
In a chaos experiment report, an engineer writes: 'The circuit breaker opened after 10 consecutive 503 responses from the cache service, preventing cascade failure to the database.' What is being described?
Circuit breaker in action: the breaker monitors error rate or consecutive failures from a dependency. When the threshold is reached (here: 10 consecutive 503s), it 'opens' — rejecting calls to that dependency immediately (without waiting for timeouts), reducing load on the failing service and preventing cascade failures downstream. The chaos experiment verified that this mechanism triggers at the configured threshold under realistic conditions. This is a hypothesis confirmation.
4 / 5
A post-mortem states: 'Our MTBF for the search service under Black Friday load is 4 hours.' What does this communicate?
MTBF = total operating time / number of failures. An MTBF of 4 hours under peak load is a reliability planning input: if the Black Friday sale runs for 12 hours, expect approximately 3 failures. This drives decisions: Can we improve MTBF by fixing the root cause (e.g., a memory leak under high traffic)? Or do we accept the MTBF and invest in fast recovery (low MTTR) and transparent failover instead? Chaos experiments can help establish baseline MTBF under specific load conditions.
5 / 5
How do chaos engineering experiments help identify 'single points of failure' (SPOFs)?
SPOF discovery is one of chaos engineering's most valuable outputs: architectural diagrams often assume redundancy that was never verified to work. A chaos experiment that terminates one instance of a 'redundant' component and causes a complete outage reveals that the redundancy was not functioning (configuration error, no health check, wrong load balancer setting). SPOFs discovered in controlled chaos experiments are fixed before they cause real production incidents.