English for vLLM Inference Developers
Learn the English vocabulary for vLLM: PagedAttention, continuous batching, KV cache, and throughput tuning for LLM serving.
vLLM discussions blend systems vocabulary (memory paging, batching) with LLM-specific terms (KV cache, tokens per second), and describing a throughput problem vaguely — “inference is slow” — gives a teammate far less to act on than naming the actual bottleneck.
Key Vocabulary
PagedAttention — the memory management technique vLLM uses to store the KV cache in non-contiguous blocks, similar to virtual memory paging, which reduces fragmentation and lets more requests share GPU memory. “PagedAttention is why vLLM can pack so many concurrent requests onto one GPU — it stops the KV cache from needing one large contiguous allocation per sequence.”
KV cache — the stored key and value tensors from previous tokens in a generation, reused on each new token so the model doesn’t recompute attention over the whole sequence from scratch. “We’re hitting out-of-memory errors because the KV cache for long-context requests is eating most of the available GPU memory.”
Continuous batching — a scheduling strategy that adds new requests into a running batch as soon as GPU capacity frees up, rather than waiting for the whole batch to finish before starting the next one. “Continuous batching is what’s giving us this throughput gain — short requests aren’t stuck waiting behind one long-running generation anymore.”
Throughput vs. latency trade-off — the balance between maximizing tokens generated per second across all requests and minimizing the time any single request waits for its response. “Increasing the max batch size helped throughput but pushed up p99 latency — we need to decide which side of that trade-off matters more for this endpoint.”
Prefill vs. decode — the two phases of generation: prefill processes the full input prompt in parallel to produce the first token, while decode generates subsequent tokens one at a time, and they have very different compute characteristics. “Long prompts are dominating GPU time in the prefill phase, which is starving the decode phase for other users’ requests.”
Common Phrases
- “Is the bottleneck in prefill or decode — are long prompts the issue, or is it long generations?”
- “How much GPU memory is the KV cache actually consuming at our current concurrency?”
- “Is continuous batching enabled here, or are we still batching statically?”
- “Are we optimizing for throughput or latency on this endpoint — which one matters more?”
- “Would PagedAttention’s memory savings let us raise the batch size without hitting OOM?”
Example Sentences
Diagnosing a memory issue: “We’re OOMing under load because the KV cache for our longest-context requests isn’t being evicted quickly enough — we need a stricter memory reservation policy per request.”
Explaining a latency regression: “P99 latency went up after we raised max batch size — we traded some latency for throughput, and now a few long-context requests are starving decode for everyone else.”
Reporting a tuning result: “Switching to continuous batching alone gave us a 2x throughput improvement, since short requests no longer wait behind whatever long generation started first.”
Professional Tips
- Distinguish prefill from decode explicitly when reporting a slowdown — “generation is slow” doesn’t tell a teammate whether the fix is about prompt length or output length.
- Name KV cache memory pressure specifically rather than saying “we’re running out of memory” — it points directly at request concurrency and context length as the levers to pull.
- State whether you’re optimizing for throughput or latency before proposing a batching change — the two goals often pull configuration in opposite directions.
- Reference PagedAttention by name when explaining why vLLM handles concurrency differently from a naive serving setup — it’s the mechanism, not just “better memory management.”
Practice Exercise
- Explain the difference between the prefill and decode phases in one sentence.
- Describe what PagedAttention solves that a fixed contiguous KV cache allocation doesn’t.
- Write a sentence explaining the throughput-versus-latency trade-off in your own words.