Learn the vocabulary of a component whose failure alone can bring down an entire system.
0 / 5 completed
1 / 5
At standup, a dev mentions a single load balancer that every request in the system passes through, with no standby instance configured, so if that one load balancer goes down, the entire system becomes unreachable. What is this kind of component called?
A single point of failure is exactly this: it is a component, like a single load balancer with no standby instance, whose failure alone can bring down the entire system, because nothing else can take over for it when it goes down. A hash collision is an unrelated hash-table concept about two keys sharing a bucket. This one-component-failure-takes-down-everything pattern is exactly why identifying and eliminating single points of failure is central to building a resilient system.
2 / 5
During a design review, the team adds a standby load balancer that automatically takes over if the primary fails, specifically because eliminating this single point of failure means one load balancer going down no longer takes the entire system offline. Which capability does this provide?
Adding a standby load balancer here provides continued availability despite a single component's failure, since a standby load balancer automatically takes over if the primary fails, instead of the whole system going offline the moment that one component fails. Running only the one load balancer with no standby means its failure alone takes the entire system down, with nothing else able to take over. This automatic-takeover-on-failure behavior is exactly why eliminating a single point of failure is the standard fix for a resilient system.
3 / 5
In a code review, a dev notices a system's architecture diagram shows every request passing through exactly one load balancer instance with no standby configured anywhere, meaning that instance's failure alone would make the entire system unreachable. What does this represent?
This is a single point of failure, since this one load balancer's failure alone can bring down the entire system, because nothing else is configured to take over for it. A cache eviction policy is an unrelated concept about discarded cache entries. This one-component-with-no-standby pattern is exactly the kind of resilience gap a reviewer flags once a system needs to stay available despite individual component failures.
4 / 5
An incident report shows the entire system went unreachable for two hours, because its one and only load balancer failed and no standby instance had ever been configured to take over automatically. What practice would prevent this?
Configuring a standby load balancer that automatically takes over if the primary fails eliminates the single point of failure that took the entire system down. Continuing to run the system with exactly one load balancer instance and no standby regardless of how long an outage lasts when that instance eventually fails is exactly what caused the multi-hour outage described in this incident. This add-a-standby approach is the standard fix once a single point of failure is confirmed capable of taking down the entire system.
5 / 5
During a PR review, a teammate asks why the team adds a standby load balancer to eliminate a single point of failure instead of simply setting up better monitoring so the team is alerted the instant the one load balancer goes down. What is the reasoning?
A standby load balancer keeps the system available automatically the moment the primary fails, with no outage window at all, while better monitoring only shortens how long it takes a human to notice and manually recover from an outage that has already started, leaving a real, if shorter, period of unreachability every time the single component fails. This is exactly why eliminating a single point of failure with redundancy is preferred over monitoring alone, which only helps respond to an outage rather than prevent it.