vLLM in Production: Essential English Vocabulary for LLM Serving Engineers

Master the English vocabulary for serving LLMs with vLLM: PagedAttention, continuous batching, tensor parallelism, KV cache, and throughput vs latency trade-offs.

Deploying large language models in production is no longer a research problem — it is a systems engineering discipline. vLLM has become the dominant open-source inference engine for serving LLMs at scale, and engineers working with it need precise English vocabulary to discuss performance, architecture, and trade-offs with their teams.

Whether you are joining a team that already runs vLLM, preparing for a technical interview, or writing documentation for your inference platform, this guide covers the core vocabulary you need.


Key Vocabulary

PagedAttention

PagedAttention is vLLM’s core memory management innovation. It stores the KV cache in non-contiguous memory blocks (called pages), similar to how operating systems manage virtual memory. This eliminates memory fragmentation and allows much higher GPU utilisation than naive allocation strategies.

“Our GPU utilisation jumped from 40% to 85% after we switched to vLLM — PagedAttention eliminated the memory waste we were seeing with our previous serving setup.”

KV Cache (Key-Value Cache)

The KV cache stores the intermediate attention keys and values computed during the prefill phase so they do not need to be recomputed during token generation. Without a KV cache, generating each new token would require reprocessing the entire prompt from scratch.

“The KV cache for a 128k-context request at batch size 32 was consuming 18 GB of VRAM — we had to limit max context length to keep within our memory budget.”

Continuous Batching

Continuous batching (also called iteration-level scheduling) is a technique where the inference server dynamically adds new requests to an in-progress batch as existing requests finish, rather than waiting for an entire batch to complete. This dramatically improves GPU utilisation when requests have variable lengths.

“Before continuous batching, a single very long request would hold up the entire batch. Now, shorter requests complete and exit while the long one keeps running.”

Prefill vs Decode

The LLM inference process has two distinct phases. The prefill phase processes the entire input prompt in a single forward pass, populating the KV cache. The decode phase generates output tokens one at a time, attending to the cached prefill values. These phases have very different computational characteristics — prefill is compute-bound, decode is memory-bandwidth-bound.

“We profiled our workload and found 60% of our GPU time was spent in prefill. For our use case, chunked prefill helped balance latency between short and long requests.”

Tensor Parallelism

Tensor parallelism is a distributed inference strategy that splits individual model layers (specifically the weight matrices) across multiple GPUs. Each GPU holds a fraction of each layer and the results are synchronised via all-reduce operations. It allows models that do not fit on a single GPU to be served efficiently.

“We run Llama 3 70B with tensor parallelism across 4 H100s — each GPU holds roughly 17 billion parameters and they communicate via NVLink.”

Throughput vs Latency

Throughput measures how many tokens (or requests) per second the system processes across all concurrent users. Latency measures how long a single user waits — commonly expressed as time-to-first-token (TTFT) and inter-token latency (ITL). Higher throughput and lower latency are often in tension: larger batches increase throughput but increase latency for individual requests.

“Our SLA requires p95 TTFT under 500ms, but maximising throughput pushes us toward larger batches that blow that budget. We tuned max batch size to find the right balance.”

OpenAI-Compatible Server

vLLM exposes an OpenAI-compatible server — an HTTP API that implements the same endpoints as the OpenAI API (/v1/completions, /v1/chat/completions). This means any client code written for the OpenAI API can point at a self-hosted vLLM instance with minimal changes.

“We migrated our application from the OpenAI API to vLLM by changing a single base URL — the OpenAI-compatible interface meant zero application code changes.”

Request Queuing

When all GPU capacity is in use, incoming requests are held in a request queue before being scheduled onto the GPU. vLLM’s scheduler manages this queue alongside the active batch. Understanding queue depth and wait time is essential for capacity planning.

“During peak traffic our request queue depth hit 200 — average queue wait time was 8 seconds. We added a second replica to bring that under 1 second.”


Useful Phrases

  • “We’re hitting our throughput target of 2,000 tokens per second, but TTFT spikes to 900ms under load — we need to tune the scheduler.”
  • “The model doesn’t fit on one GPU, so we’re using tensor parallelism across two nodes with pipeline parallelism between them.”
  • “KV cache pressure is evicting active sequences — we need to either reduce max context length or add memory.”
  • “Our vLLM deployment exposes an OpenAI-compatible endpoint so the rest of the stack didn’t need to change.”
  • “We profile prefill and decode separately because they have completely different bottlenecks — prefill is compute-bound, decode is bandwidth-bound.”

Common Mistakes

Confusing throughput and latency

Non-native speakers sometimes use throughput and latency interchangeably, or use “speed” to mean both. In production LLM serving, these are distinct metrics that often conflict. Throughput is a rate (tokens/second across all users); latency is a duration (milliseconds per request). Say “Our throughput is 1,500 tokens/s but our p99 latency is too high”, not “Our speed is good but slow”.

Using “cache” as a verb incorrectly

A common mistake is saying “the model caches the attention” when the correct phrase is “the attention keys and values are stored in the KV cache” or “the prefill phase populates the KV cache”. The cache is a noun (a data structure); the action is “populate”, “store into”, or “read from” the cache.

Mispronouncing technical acronyms

VRAM is pronounced as individual letters: “V-RAM”, not “vram”. GPU is “G-P-U”, not “gypu”. KV cache is “K-V cache”, not “kev cache”. Using incorrect pronunciation in meetings can undermine your credibility even when your technical understanding is correct.


Mastering vLLM vocabulary allows you to participate fully in architecture reviews, incident retrospectives, and capacity planning discussions. The terms above — particularly the distinction between prefill and decode, and the tension between throughput and latency — come up constantly in any team running LLMs in production. Once you can use them precisely, you will find that complex performance conversations become significantly clearer.