IntermediateVocabulary#llama.cpp#GGUF#quantization#local LLM

llama.cpp Inference Exercises

llama.cpp enables efficient local inference of large language models on consumer hardware. These exercises cover the GGUF file format, quantization levels and tradeoffs, GPU layer offloading with -ngl, context size configuration, and platform-specific backends including Metal for Apple Silicon.

0 / 5 completed

1 / 5

What is the GGUF file format used by llama.cpp?

2 / 5

A developer loads a model with -ngl 35 flag in llama.cpp. What does this parameter control?

3 / 5

Which quantization level in llama.cpp provides the best quality-to-size tradeoff for most use cases?

4 / 5

A developer starts the llama.cpp server with ./llama-server -m model.gguf --ctx-size 8192. What does --ctx-size control?

5 / 5

When using llama.cpp on Apple Silicon, which backend enables GPU acceleration?