Anthropic Prompt Caching: English for Context Window Optimization
Learn advanced English vocabulary for Anthropic prompt caching, context window optimisation, cache invalidation, and cost-performance trade-offs in production LLM systems.
Anthropic’s prompt caching feature allows you to cache large portions of a prompt — system instructions, reference documents, few-shot examples — so they are not reprocessed on every API call. This significantly reduces latency and cost for workloads that reuse the same context repeatedly. To discuss this intelligently in English — in design reviews, postmortems, or Slack threads — you need precise vocabulary. This guide covers the terms, patterns, and phrases you will encounter in real engineering work.
Foundational Caching Vocabulary
Cache
A cache (noun) is a storage layer that holds copies of frequently used data so it can be retrieved faster or at lower cost than recomputing it from scratch. As a verb, “to cache” means to store something in a cache.
“We cache the system prompt so the model does not re-tokenise it on every request.”
The adjective form is cached (“cached tokens”) and the opposite is uncached. Do not confuse cache (storage) with cash (money) — they are pronounced identically in English, which trips up many non-native speakers.
Cache Hit / Cache Miss
A cache hit occurs when the requested data is found in the cache. A cache miss occurs when it is not — the system must then compute or fetch the data fresh.
“Our cache hit rate dropped after we changed the system prompt — any edit to the cached prefix invalidates the cache.”
Cache Invalidation
Cache invalidation is the process of removing or marking cached data as stale when the underlying data changes. It is famously described as one of the two hard problems in computer science.
“We bumped the product description and forgot that invalidates the cache — we saw a spike in input token costs immediately.”
Prompt Caching-Specific Vocabulary
Prompt Prefix
In the context of Anthropic’s caching, the prompt prefix is the portion of the prompt you mark for caching. It must appear at the beginning of the prompt (or system section). Only this leading segment is cached; anything after it is always processed fresh.
“We structure the prompt so that the 10,000-token knowledge base is the prefix — only the user question varies per request.”
Cache Breakpoint
A cache breakpoint (also called a “cache control marker”) is the explicit marker in the API request telling Anthropic where the cached portion ends. In the Claude API, you add "cache_control": {"type": "ephemeral"} to a content block to mark it.
“Set the cache breakpoint after the static instructions and before the dynamic user input.”
Ephemeral Cache
In Anthropic’s implementation, the cache is ephemeral — it has a limited lifetime (currently 5 minutes, extended with each use). If a request does not reuse the cached prefix within that window, the cache expires.
“The cache is ephemeral, so for low-traffic endpoints you may not see the savings you expect.”
Input Tokens vs. Cache Write Tokens vs. Cache Read Tokens
Anthropic’s API usage breakdown includes three categories of tokens:
- Input tokens — standard tokens that are always processed
- Cache write tokens — tokens written to the cache on the first request (priced at 1.25× normal input)
- Cache read tokens — tokens read from cache on subsequent requests (priced at 0.1× normal input — a 90% discount)
“The first call is more expensive due to cache write tokens, but every subsequent call with the same prefix costs almost nothing for that portion.”
Context Window Optimisation Vocabulary
Context Window
The context window is the maximum number of tokens an LLM can process in a single request — both input and output combined. Claude models currently have context windows of 200,000 tokens.
“With a 200k context window, we can include entire codebases, but we need to be smart about what we cache.”
Token Budget
A token budget is the allocation of tokens across different parts of a prompt — system instructions, conversation history, retrieved context, and the user’s current message. Managing the token budget prevents hitting context limits and controls cost.
“We allocate 50,000 tokens to the document corpus and reserve 10,000 for the conversation history.”
Context Stuffing
Context stuffing (informal) refers to the practice of including as much relevant information as possible in the prompt context — documents, examples, instructions — to give the model more to work with. Prompt caching makes context stuffing economically viable.
“Context stuffing was too expensive without caching — now we can afford to include the full API reference in every call.”
Reranking
Reranking is a technique where candidate documents retrieved for RAG are scored and reordered by relevance before being included in the prompt. Combining reranking with prompt caching means you cache only the highest-value documents.
“We rerank the retrieved chunks and cache only the top 5 — this keeps the cached prefix stable and the hit rate high.”
System Design and Trade-off Vocabulary
Latency Reduction
Latency reduction is the decrease in response time achieved by caching. Because cached tokens are not re-processed, time-to-first-token (TTFT) improves.
“Prompt caching reduced our median TTFT from 3.2 seconds to 0.8 seconds for repeat-prefix requests.”
Time to First Token (TTFT)
TTFT is the time elapsed from sending an API request to receiving the first token of the response. It is a key UX metric for streaming LLM applications.
Cost-Performance Trade-off
The cost-performance trade-off describes the balance between spending more (e.g., on cache writes, larger prompts) and gaining performance benefits (lower latency, lower per-request cost).
“Prompt caching shifts the cost curve: you pay more on the first request but dramatically less on all subsequent ones.”
Amortisation
To amortise a cost means to spread it over many uses so the per-unit cost decreases. The cache write cost is amortised across all requests that reuse the cached prefix.
“At our request volume, the cache write cost is amortised over thousands of calls — the ROI is obvious.”
Key Collocations
These phrases appear frequently in engineering discussions about prompt caching:
- cache hit rate — “We track the cache hit rate in our observability dashboard.”
- prefix stability — “Keep the cached prefix stable — even whitespace changes invalidate it.”
- token cost savings — “Prompt caching delivers 80–90% token cost savings on the repeated portion.”
- context reuse — “High context reuse is the key signal that caching will help.”
- cache warm-up — “The first request after deployment warms up the cache.”
- invalidation risk — “Frequent prompt iterations increase invalidation risk.”
- amortise the write cost — “You need sufficient request volume to amortise the write cost.”
- system prompt engineering — “Good system prompt engineering keeps the cacheable prefix clean and stable.”
Phrases for Design Reviews and Postmortems
- “We should move the static knowledge base to the cached prefix — it’s 80k tokens and it doesn’t change between requests.”
- “The cache invalidated because someone added a trailing newline to the system prompt — we need a prompt versioning process.”
- “What’s our cache hit rate this week? If it’s below 70%, caching isn’t pulling its weight.”
- “We’re paying for cache write tokens on every deploy — we should pre-warm the cache in the deployment script.”
- “The ephemeral cache window is 5 minutes. For our traffic pattern, is that enough, or do we need to send keep-alive requests?”
Practice
Write a short architecture decision record (ADR) in English — 4 to 6 sentences — describing the decision to adopt Anthropic prompt caching for a feature in your system. State the context, the decision, and the trade-offs. Use at least 6 terms from this guide: cache hit rate, cache write tokens, cache read tokens, prefix stability, latency reduction, amortise, ephemeral, token budget. Focus on precision: in English technical writing, the goal is clarity and specificity, not complexity.