vLLM delivers state-of-the-art LLM serving throughput through PagedAttention memory management and continuous batching. These exercises cover KV cache paging, tensor parallelism, continuous vs. static batching, API endpoint compatibility, and greedy decoding configuration.
0 / 5 completed
1 / 5
What is PagedAttention in vLLM and what problem does it solve?
PagedAttention manages the KV cache like an OS virtual memory system with fixed-size pages (blocks). Instead of pre-allocating contiguous memory for each sequence's maximum context length, pages are allocated on demand. This eliminates KV cache fragmentation and allows sharing prompt KV cache across multiple requests (prefix caching).
2 / 5
A vLLM server is started with --tensor-parallel-size 4. What does this configure?
Tensor parallelism in vLLM shards the model's weight matrices across N GPUs, enabling models too large for a single GPU. Each GPU holds 1/N of the weights and they communicate via all-reduce operations during the forward pass. This is distinct from pipeline parallelism (--pipeline-parallel-size) which splits layers across GPUs.
3 / 5
vLLM exposes an OpenAI-compatible API. Which endpoint path handles chat completion requests?
vLLM's OpenAI-compatible server exposes /v1/chat/completions for chat models and /v1/completions for base models. This compatibility means existing code using the OpenAI Python SDK can switch to a local vLLM server by simply changing the base_url parameter.
4 / 5
What is continuous batching in vLLM and how does it differ from static batching?
Continuous batching (iteration-level scheduling) inserts new requests into the running batch between decoding steps when sequences finish, rather than waiting for an entire batch to complete. This dramatically improves GPU utilization compared to static batching where the GPU idles waiting for the slowest sequence in a batch.
5 / 5
A developer uses vLLM's LLM class offline with sampling_params = SamplingParams(temperature=0, max_tokens=100). What does temperature=0 mean?
Setting temperature=0 in vLLM's SamplingParams enables greedy decoding — the model always selects the token with the highest probability at each step. This produces deterministic, reproducible outputs. Non-zero temperatures introduce randomness by scaling the logits before sampling.