BeginnerVocabulary#Ollama#local LLM#Modelfile#REST API
Ollama Local Model Serving Exercises
Ollama simplifies running large language models locally with automatic model management and a REST API. These exercises cover pulling and running models, creating custom Modelfiles, the streaming REST API, monitoring loaded models, and automatic GPU memory management.
0 / 5 completed
1 / 5
A developer runs ollama run llama3.2 for the first time. What happens before the interactive session starts?
If the model isn't already in the local cache (~/.ollama/models), ollama run automatically pulls the model from library.ollama.com. Models are stored in GGUF format in a content-addressed store. Subsequent runs use the cached version without re-downloading.
2 / 5
What is the purpose of a Modelfile in Ollama?
An Ollama Modelfile is analogous to a Dockerfile — it lets you customize a base model by setting a SYSTEM prompt, adjusting PARAMETER values (temperature, context size), adding ADAPTER LoRA weights, or creating a FROM scratch model. Running ollama create mymodel -f Modelfile builds the customized model.
3 / 5
A developer calls POST /api/generate on Ollama's REST API with "stream": false. What changes about the response?
By default, Ollama streams tokens as newline-delimited JSON objects. Setting "stream": false disables streaming and returns a single JSON response containing the complete generated text once generation finishes. This simplifies client code that doesn't need streaming.
4 / 5
Which Ollama CLI command shows the currently running models and their memory usage?
ollama ps displays currently loaded models, their sizes, and processor usage (CPU/GPU). ollama list shows all locally available models regardless of whether they're loaded. Models stay loaded in memory for a default of 5 minutes after last use before being unloaded.
5 / 5
How does Ollama determine how many GPU layers to use when loading a model?
Ollama automatically detects available VRAM and offloads as many model layers to GPU as possible. If the model fits entirely in VRAM, all layers run on GPU. If not, the remaining layers run on CPU. This automatic behavior can be overridden with the OLLAMA_GPU_LAYERS environment variable.