AdvancedVocabulary#vLLM#LLM serving#PagedAttention#inference

vLLM Production LLM Serving Exercises

vLLM delivers state-of-the-art LLM serving throughput through PagedAttention memory management and continuous batching. These exercises cover KV cache paging, tensor parallelism, continuous vs. static batching, API endpoint compatibility, and greedy decoding configuration.

0 / 5 completed

1 / 5

What is PagedAttention in vLLM and what problem does it solve?

2 / 5

A vLLM server is started with --tensor-parallel-size 4. What does this configure?

3 / 5

vLLM exposes an OpenAI-compatible API. Which endpoint path handles chat completion requests?

4 / 5

What is continuous batching in vLLM and how does it differ from static batching?

5 / 5

A developer uses vLLM's LLM class offline with sampling_params = SamplingParams(temperature=0, max_tokens=100). What does temperature=0 mean?