Build fluency in the vocabulary of caching an LLM response by a prompt's meaning, not its exact text.
0 / 5 completed
1 / 5
At standup, a dev mentions caching an LLM response keyed by the semantic meaning of a prompt, so a differently worded but equivalent question can still return the cached answer instead of triggering a new model call. What is this technique called?
Semantic caching keys a cached LLM response by the semantic meaning of a prompt, using something like an embedding similarity match, so a differently worded but equivalent question can still return the cached answer instead of triggering an expensive new model call. Caching only on an exact character-for-character string match misses the very common case where two users ask essentially the same question with different phrasing. This semantic matching captures a much larger share of realistic cache hits than exact-match caching alone.
2 / 5
During a design review, the team wants to set a similarity-score threshold above which two prompts are considered close enough to share a cached response, avoiding a false match on a subtly different question. Which capability supports this?
A tuned similarity threshold sets how close two prompts' embeddings need to be before they're considered close enough to share a cached response, avoiding a false match where a subtly different question, like one with an added negation, incorrectly reuses an unrelated cached answer. Treating any nonzero similarity as an automatic hit risks exactly this kind of subtly wrong reuse. Tuning this threshold carefully is essential, since a semantic cache with a threshold that's too loose can silently return an incorrect answer.
3 / 5
In a code review, a dev notices the cache stores a response's embedding alongside metadata like which prompt template and model version produced it, so a cache entry isn't served after that template or model changes. What does this represent?
Cache invalidation tied to prompt template and model version metadata ensures a cached entry isn't served after the underlying prompt template or model that originally produced it has since changed. Serving a cached response indefinitely with no such invalidation risks returning a stale answer that no longer reflects the current, updated system's actual behavior. This metadata-based invalidation keeps a semantic cache's speed benefit from coming at the cost of serving an outdated response.
4 / 5
An incident report shows a semantic cache's similarity threshold was set too loosely, and a user asking 'how do I disable X' received a cached answer for 'how do I enable X' due to a false-positive match. What practice would prevent this?
Tightening and rigorously testing the similarity threshold against a set of intentionally tricky near-duplicate prompts, like an enable-versus-disable pair, catches this kind of false-positive match before it reaches a real user. Setting the threshold as loosely as possible purely to maximize the hit rate directly causes this exact failure mode. This careful threshold tuning and testing is essential because a semantically similar prompt can still have a critically different, even opposite, intended meaning.
5 / 5
During a PR review, a teammate asks why the team implements semantic caching instead of a simpler exact-string-match cache for LLM responses. What is the reasoning?
An exact-string-match cache only ever hits when a prompt is repeated character for character, which misses the very common real-world case of two differently worded but practically equivalent questions. Semantic caching captures that similarity, returning a much higher proportion of realistic cache hits and reducing model-call cost. The tradeoff is the risk of a false-positive match if the similarity threshold isn't tuned and tested carefully enough.