Anthropic Prompt Caching: English for Context Window Optimization

Anthropic’s prompt caching feature allows you to cache large portions of a prompt — system instructions, reference documents, few-shot examples — so they are not reprocessed on every API call. This significantly reduces latency and cost for workloads that reuse the same context repeatedly. To discuss this intelligently in English — in design reviews, postmortems, or Slack threads — you need precise vocabulary. This guide covers the terms, patterns, and phrases you will encounter in real engineering work.

Foundational Caching Vocabulary

Cache

A cache (noun) is a storage layer that holds copies of frequently used data so it can be retrieved faster or at lower cost than recomputing it from scratch. As a verb, “to cache” means to store something in a cache.

“We cache the system prompt so the model does not re-tokenise it on every request.”

The adjective form is cached (“cached tokens”) and the opposite is uncached. Do not confuse cache (storage) with cash (money) — they are pronounced identically in English, which trips up many non-native speakers.

Cache Hit / Cache Miss

A cache hit occurs when the requested data is found in the cache. A cache miss occurs when it is not — the system must then compute or fetch the data fresh.

“Our cache hit rate dropped after we changed the system prompt — any edit to the cached prefix invalidates the cache.”

Cache Invalidation

Cache invalidation is the process of removing or marking cached data as stale when the underlying data changes. It is famously described as one of the two hard problems in computer science.

“We bumped the product description and forgot that invalidates the cache — we saw a spike in input token costs immediately.”

Prompt Caching-Specific Vocabulary

Prompt Prefix

In the context of Anthropic’s caching, the prompt prefix is the portion of the prompt you mark for caching. It must appear at the beginning of the prompt (or system section). Only this leading segment is cached; anything after it is always processed fresh.

“We structure the prompt so that the 10,000-token knowledge base is the prefix — only the user question varies per request.”

Cache Breakpoint

A cache breakpoint (also called a “cache control marker”) is the explicit marker in the API request telling Anthropic where the cached portion ends. In the Claude API, you add "cache_control": {"type": "ephemeral"} to a content block to mark it.

“Set the cache breakpoint after the static instructions and before the dynamic user input.”

Ephemeral Cache

In Anthropic’s implementation, the cache is ephemeral — it has a limited lifetime (currently 5 minutes, extended with each use). If a request does not reuse the cached prefix within that window, the cache expires.

“The cache is ephemeral, so for low-traffic endpoints you may not see the savings you expect.”

Input Tokens vs. Cache Write Tokens vs. Cache Read Tokens

Anthropic’s API usage breakdown includes three categories of tokens:

Input tokens — standard tokens that are always processed
Cache write tokens — tokens written to the cache on the first request (priced at 1.25× normal input)
Cache read tokens — tokens read from cache on subsequent requests (priced at 0.1× normal input — a 90% discount)

“The first call is more expensive due to cache write tokens, but every subsequent call with the same prefix costs almost nothing for that portion.”

Context Window Optimisation Vocabulary

Context Window

The context window is the maximum number of tokens an LLM can process in a single request — both input and output combined. Claude models currently have context windows of 200,000 tokens.

“With a 200k context window, we can include entire codebases, but we need to be smart about what we cache.”

Token Budget

A token budget is the allocation of tokens across different parts of a prompt — system instructions, conversation history, retrieved context, and the user’s current message. Managing the token budget prevents hitting context limits and controls cost.

“We allocate 50,000 tokens to the document corpus and reserve 10,000 for the conversation history.”

Context Stuffing

Context stuffing (informal) refers to the practice of including as much relevant information as possible in the prompt context — documents, examples, instructions — to give the model more to work with. Prompt caching makes context stuffing economically viable.

“Context stuffing was too expensive without caching — now we can afford to include the full API reference in every call.”

Reranking

Reranking is a technique where candidate documents retrieved for RAG are scored and reordered by relevance before being included in the prompt. Combining reranking with prompt caching means you cache only the highest-value documents.

“We rerank the retrieved chunks and cache only the top 5 — this keeps the cached prefix stable and the hit rate high.”

System Design and Trade-off Vocabulary

Latency Reduction

Latency reduction is the decrease in response time achieved by caching. Because cached tokens are not re-processed, time-to-first-token (TTFT) improves.

“Prompt caching reduced our median TTFT from 3.2 seconds to 0.8 seconds for repeat-prefix requests.”

Time to First Token (TTFT)

TTFT is the time elapsed from sending an API request to receiving the first token of the response. It is a key UX metric for streaming LLM applications.

Cost-Performance Trade-off

The cost-performance trade-off describes the balance between spending more (e.g., on cache writes, larger prompts) and gaining performance benefits (lower latency, lower per-request cost).

“Prompt caching shifts the cost curve: you pay more on the first request but dramatically less on all subsequent ones.”

Amortisation

To amortise a cost means to spread it over many uses so the per-unit cost decreases. The cache write cost is amortised across all requests that reuse the cached prefix.

“At our request volume, the cache write cost is amortised over thousands of calls — the ROI is obvious.”

Key Collocations

These phrases appear frequently in engineering discussions about prompt caching:

cache hit rate — “We track the cache hit rate in our observability dashboard.”
prefix stability — “Keep the cached prefix stable — even whitespace changes invalidate it.”
token cost savings — “Prompt caching delivers 80–90% token cost savings on the repeated portion.”
context reuse — “High context reuse is the key signal that caching will help.”
cache warm-up — “The first request after deployment warms up the cache.”
invalidation risk — “Frequent prompt iterations increase invalidation risk.”
amortise the write cost — “You need sufficient request volume to amortise the write cost.”
system prompt engineering — “Good system prompt engineering keeps the cacheable prefix clean and stable.”

Phrases for Design Reviews and Postmortems

“We should move the static knowledge base to the cached prefix — it’s 80k tokens and it doesn’t change between requests.”
“The cache invalidated because someone added a trailing newline to the system prompt — we need a prompt versioning process.”
“What’s our cache hit rate this week? If it’s below 70%, caching isn’t pulling its weight.”
“We’re paying for cache write tokens on every deploy — we should pre-warm the cache in the deployment script.”
“The ephemeral cache window is 5 minutes. For our traffic pattern, is that enough, or do we need to send keep-alive requests?”

Practice

Write a short architecture decision record (ADR) in English — 4 to 6 sentences — describing the decision to adopt Anthropic prompt caching for a feature in your system. State the context, the decision, and the trade-offs. Use at least 6 terms from this guide: cache hit rate, cache write tokens, cache read tokens, prefix stability, latency reduction, amortise, ephemeral, token budget. Focus on precision: in English technical writing, the goal is clarity and specificity, not complexity.

Anthropic Prompt Caching: English for Context Window Optimization

Foundational Caching Vocabulary

Cache

Cache Hit / Cache Miss

Cache Invalidation

Prompt Caching-Specific Vocabulary

Prompt Prefix

Cache Breakpoint

Ephemeral Cache

Input Tokens vs. Cache Write Tokens vs. Cache Read Tokens

Context Window Optimisation Vocabulary

Context Window

Token Budget

Context Stuffing

Reranking

System Design and Trade-off Vocabulary

Latency Reduction

Time to First Token (TTFT)

Cost-Performance Trade-off

Amortisation

Key Collocations

Phrases for Design Reviews and Postmortems

Practice

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules

Foundational Caching Vocabulary

Cache

Cache Hit / Cache Miss

Cache Invalidation

Prompt Caching-Specific Vocabulary

Prompt Prefix

Cache Breakpoint

Ephemeral Cache

Input Tokens vs. Cache Write Tokens vs. Cache Read Tokens

Context Window Optimisation Vocabulary

Context Window

Token Budget

Context Stuffing

Reranking

System Design and Trade-off Vocabulary

Latency Reduction

Time to First Token (TTFT)

Cost-Performance Trade-off

Amortisation

Key Collocations

Phrases for Design Reviews and Postmortems

Practice

Related Articles

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules