How to Explain a Cache Stampede in English
Learn the English vocabulary and phrases developers need to explain a cache stampede incident to their team, from what triggered it to how to prevent the next one.
A cache stampede is one of those incidents that looks like a sudden traffic spike but is actually self-inflicted — the moment a popular cache entry expires, every waiting request piles onto the database at once. Explaining this clearly matters because the instinctive reaction (“we got hit by more traffic”) leads teammates toward the wrong fix, like scaling the database instead of fixing the caching logic.
Key Vocabulary
Cache stampede (a.k.a. thundering herd) — a surge of near-simultaneous requests that all miss the cache at the same moment, usually right after a popular key expires, overwhelming the backend that regenerates the value. “This wasn’t organic traffic growth — it was a cache stampede triggered the instant that key’s TTL expired.”
TTL (Time to Live) — the duration a cache entry is considered valid before it expires and must be regenerated. “We set the TTL to one hour, which felt safe, but this key gets requested thousands of times a second, so even a one-second gap after expiry causes a pileup.”
Lock-based regeneration — a stampede-prevention pattern where only the first request after expiration is allowed to regenerate the value, while others wait briefly or receive a slightly stale copy. “We’re adding lock-based regeneration so only one request rebuilds the cache entry, instead of every concurrent request hitting the database at once.”
Stale-while-revalidate — a strategy that serves an expired cache value immediately while regenerating it in the background, trading a small amount of staleness for eliminating the stampede entirely. “With stale-while-revalidate, nobody has to wait on a cold cache — they get the slightly old value instantly while we refresh it behind the scenes.”
Jittered expiration — adding a small random offset to each cache entry’s TTL so that many keys set at the same time don’t all expire in the same instant. “We added jitter to the TTL so these ten thousand keys, all cached at deploy time, don’t all expire in the same second next week.”
Explaining the Root Cause
- “This wasn’t a traffic spike — the database load jumped because one popular key expired and every request that missed the cache hit the database at the same second.”
- “We didn’t have any protection against concurrent regeneration, so a thousand requests all tried to rebuild the same cache entry simultaneously.”
- “The timing lines up exactly with the TTL expiring, which is why I’m confident this is a stampede and not an actual increase in users.”
Communicating What Needs to Change
- “I want to add lock-based regeneration so only the first request rebuilds the value, and the rest wait a few milliseconds instead of also hitting the database.”
- “Let’s switch to stale-while-revalidate for this key specifically, since a few seconds of staleness is much cheaper than a database overload.”
- “I’m adding jitter to our TTLs so keys cached around the same time don’t all expire in the same instant.”
Verifying the Fix Together
- “Can we replay this traffic pattern in staging and confirm the database no longer spikes when the key expires?”
- “Let’s watch cache hit rate and database query volume together during the next expected expiration window.”
- “If this happens again, check whether it lines up with a TTL boundary before assuming it’s a genuine traffic increase.”
Professional Tips
- Name the mechanism, not just the symptom. Saying “this is a cache stampede” instead of “the database got overloaded” tells the team exactly what class of fix is needed, rather than inviting a scaling conversation that won’t solve the underlying problem.
- Point to the timing correlation as evidence. Showing that the spike aligns precisely with a TTL expiration is far more convincing than describing the symptom alone, and it heads off debate about whether this was “just growth.”
- Propose a specific prevention pattern, not a vague mitigation. “Lock-based regeneration” or “stale-while-revalidate” gives engineers a concrete design to implement, instead of a general instruction to “make caching better.”
Practice Exercise
- Write two sentences explaining to a teammate why a database load spike was actually a cache stampede, not real traffic growth.
- Describe, in one sentence, the difference between lock-based regeneration and stale-while-revalidate as stampede prevention strategies.
- Draft a short message proposing to add jitter to a set of cache TTLs that currently all expire at the same time.