llama.cpp enables efficient local inference of large language models on consumer hardware. These exercises cover the GGUF file format, quantization levels and tradeoffs, GPU layer offloading with -ngl, context size configuration, and platform-specific backends including Metal for Apple Silicon.
0 / 5 completed
1 / 5
What is the GGUF file format used by llama.cpp?
GGUF (GPT-Generated Unified Format) is llama.cpp's model format that bundles model weights, tokenizer vocabulary, and all metadata in a single file. It replaced the older GGML format, adding extensible key-value metadata headers and making models self-contained without separate config files.
2 / 5
A developer loads a model with -ngl 35 flag in llama.cpp. What does this parameter control?
-ngl (--n-gpu-layers) specifies how many transformer layers to offload to the GPU. Setting it to the total number of layers runs the full model on GPU. Setting it to 0 runs entirely on CPU. Partial offloading (e.g., -ngl 20 on a 32-layer model) lets you use GPU for some layers when VRAM is limited.
3 / 5
Which quantization level in llama.cpp provides the best quality-to-size tradeoff for most use cases?
Q4_K_M is widely regarded as the sweet spot: it uses approximately 4 bits per weight with a medium-quality K-quant scheme that preserves accuracy on attention layers. Q2_K has noticeable quality degradation; Q8_0 and F16 are higher quality but 2x-4x larger with minimal perceptible benefit for most tasks.
4 / 5
A developer starts the llama.cpp server with ./llama-server -m model.gguf --ctx-size 8192. What does --ctx-size control?
--ctx-size sets the maximum context length in tokens that the KV cache is allocated for. Setting it higher than necessary wastes VRAM/RAM. The model's architecture defines its maximum supported context, and setting --ctx-size beyond that limit may produce degraded or undefined behavior.
5 / 5
When using llama.cpp on Apple Silicon, which backend enables GPU acceleration?
llama.cpp uses Apple's Metal GPU API for hardware acceleration on Apple Silicon (M1/M2/M3/M4). Metal is enabled by default when building on macOS. It allows offloading transformer layers to the unified memory GPU, achieving significantly faster inference than CPU-only execution.