Vocabulary for Circuit Breakers and Resilience Patterns

Learn the essential English vocabulary for discussing circuit breakers, retries, bulkheads, and other resilience patterns in distributed systems.

Resilience patterns are how distributed systems avoid one failing dependency taking down everything connected to it, and each pattern has a specific name for a reason — mixing up “retry” and “circuit breaker” in a design discussion can lead a team to pick the wrong tool for the failure mode they’re actually worried about. This vocabulary is especially useful in architecture reviews and postmortems where the goal is naming exactly which protective mechanism was missing.

Foundational Concepts

1. Cascading failure

A failure that spreads from one component to others it depends on or that depend on it, often because none of them had protection against an upstream failure.

Usage: “The database slowdown became a cascading failure because every service kept retrying failed queries aggressively, which only added more load to an already struggling database.”

2. Circuit breaker

A pattern that stops sending requests to a dependency after its failure rate crosses a threshold, giving it time to recover instead of continuing to send traffic it can’t handle.

Usage: “The circuit breaker tripped open after this dependency’s error rate crossed 50%, which stopped us from adding more load to something already failing.”

3. Half-open state

The intermediate state a circuit breaker enters after tripping, where it allows a small number of test requests through to check whether the dependency has recovered before fully reopening traffic.

Usage: “The breaker moved to half-open and let a handful of requests through — since those succeeded, it closed again and resumed normal traffic.”

4. Bulkhead (bulkhead isolation)

A pattern that isolates resources (like thread pools or connection pools) per dependency, so that one failing dependency exhausting its resources can’t starve requests to unrelated dependencies.

Usage: “Without bulkhead isolation, this slow third-party API was consuming every available thread in our shared pool, which is why unrelated requests started timing out too.”

5. Timeout

An explicit limit on how long a request is allowed to wait for a response before it’s treated as failed, preventing a slow dependency from holding resources indefinitely.

Usage: “This client had no timeout configured at all, so a hung connection to the dependency would tie up a worker thread indefinitely instead of failing fast.”

Retry and Backoff

6. Retry (retry logic)

The practice of automatically re-attempting a failed operation, useful for transient failures but dangerous if applied carelessly during a sustained outage.

Usage: “Retry logic here has no limit, so during the outage, every client kept retrying indefinitely, which multiplied the load hitting the already-struggling service.”

7. Exponential backoff

A retry strategy where the wait time between attempts increases exponentially with each failure, reducing the load placed on a struggling dependency during sustained failures.

Usage: “We added exponential backoff to the retry logic so repeated failures space out over time instead of hammering the dependency at a constant rate.”

8. Jitter

Randomization added to a backoff delay to prevent many clients from retrying in synchronized bursts, which would otherwise recreate the exact load spike the backoff was meant to avoid.

Usage: “Without jitter, all our clients backed off on the exact same schedule and then retried simultaneously, recreating the same traffic spike a moment later.”

9. Retry storm

A cascading pattern where many clients retrying a failing dependency simultaneously overwhelms it further, potentially preventing recovery even after the original problem clears up.

Usage: “The retry storm outlasted the actual outage — by the time the database recovered, the backlog of retries alone was enough to keep it overloaded for another ten minutes.”

10. Fail fast

A design principle where a system detects and surfaces a failure quickly rather than waiting or retrying extensively, preserving resources and giving callers an immediate, actionable signal.

Usage: “We’re deliberately choosing to fail fast here rather than retry — this operation isn’t idempotent, so blind retries risk doing more harm than a quick, clear failure.”

Fallbacks and Degradation

11. Graceful degradation

The practice of a system continuing to provide reduced but still useful functionality when a dependency fails, rather than failing completely.

Usage: “When the recommendation service is down, we fall back to showing a generic popular-items list — that’s graceful degradation instead of a broken page.”

12. Fallback

A predefined alternative response or behavior used when a primary operation fails, allowing the overall system to continue functioning in a reduced capacity.

Usage: “We added a fallback that serves the last successfully cached response when the live pricing service times out, rather than showing an error to the user.”

13. Load shedding

The deliberate rejection of some incoming requests during overload, prioritizing the system’s ability to serve remaining requests reliably over trying to serve everything and failing broadly.

Usage: “We’re load shedding low-priority background requests during this spike so the system can keep serving the checkout flow reliably.”

14. Health check (dependency health check)

A mechanism for actively verifying whether a dependency is currently able to serve requests, used to inform routing, circuit breaker state, or alerting decisions.

Usage: “Our health check only verifies the process is running, not that it can actually reach the database — that’s why it kept reporting healthy during the actual outage.”

15. Blast radius

The scope of impact a single component’s failure can have across a system, a concept used to evaluate whether isolation mechanisms like bulkheads are sufficient.

Usage: “This dependency’s blast radius is currently the entire platform, since every service shares the same connection pool to it — that’s exactly what bulkhead isolation would fix.”

Key Takeaways

  • Distinguish circuit breakers from retries precisely — a circuit breaker stops sending requests to a failing dependency, while retries re-attempt individual failed operations.
  • Always pair retries with exponential backoff and jitter, since naive retries during an outage can cause a retry storm that outlasts the original failure.
  • Use bulkhead isolation vocabulary when discussing shared resource pools, since one dependency’s failure shouldn’t be able to starve unrelated requests.
  • Reach for graceful degradation and fallback patterns explicitly when designing for a dependency outage, rather than only planning for the happy path.
  • Frame blast radius as a concrete design question during architecture reviews — how far does a single component’s failure actually spread, and what isolation mechanism would contain it.