English for Ollama Local LLMs
Learn the English vocabulary for running local LLMs with Ollama: model pulls, quantization, Modelfiles, and GPU offloading, explained clearly.
Running LLMs locally with Ollama surfaces a specific vocabulary — quantization levels, Modelfiles, GPU offloading — that’s easy to gesture at vaguely but important to use precisely when someone else needs to reproduce your setup or debug why a model is slow. This guide covers the terms.
Key Vocabulary
Pull — the act of downloading a model’s weights from Ollama’s registry to local disk, analogous to docker pull for container images.
“Pull the model before the demo — downloading eight gigabytes over conference wifi live is not a good look.”
Modelfile — a configuration file, similar in spirit to a Dockerfile, that customizes a base model with a system prompt, parameters, or a different quantization before building a named local variant.
“We wrote a Modelfile that sets a fixed system prompt and lower temperature, so ollama run support-bot behaves consistently without callers repeating the same instructions.”
Quantization level — the precision (e.g., q4_0, q8_0, fp16) a model’s weights are stored at, trading off memory footprint and speed against output quality.
“We dropped to a 4-bit quantization level to fit the model in 8GB of VRAM — there’s a small quality cost, but it’s acceptable for our use case.”
Context window — the maximum number of tokens (prompt plus generated response) a model can process in a single request, configurable in Ollama up to the base model’s supported limit. “The summarization is getting cut off because our context window setting is smaller than the document length — bump it up in the Modelfile or the request parameters.”
GPU offloading (num_gpu layers) — the setting controlling how many of a model’s layers run on GPU versus CPU, used to balance speed against available VRAM.
“If the model doesn’t fully fit in VRAM, reduce the GPU offload layer count rather than letting it fail — partial offloading is slower but still faster than pure CPU inference.”
Model tag — the version identifier (llama3:8b, llama3:70b) appended to a model name specifying which parameter size or fine-tune variant to pull and run.
“Make sure the deployment script pins an exact model tag — pulling latest risks silently swapping to a different parameter size after an upstream update.”
Common Phrases
- “Did you pull the model already, or is this the first run downloading it now?”
- “What quantization level are we running — is that why output quality dropped?”
- “Is this a context window limit, or is the model actually failing to generate?”
- “Are we offloading fully to GPU, or is some of this running on CPU and causing the slowdown?”
- “Is the model tag pinned, or could
latesthave changed underneath us?”
Example Sentences
Explaining a setup change in a PR: “I added a Modelfile that sets a lower temperature and a fixed system prompt, so local testing behaves consistently instead of everyone tweaking the same base model differently.”
Reporting a performance issue: “Inference is much slower on my machine than the team’s benchmark — it turns out my GPU doesn’t have enough VRAM for full offloading, so most layers are running on CPU.”
Discussing a quality regression: “The responses got noticeably worse after the quantization change — we should benchmark the 8-bit version against the 4-bit one before deciding the memory savings are worth it.”
Professional Tips
- Pin an exact model tag in any reproducible setup — relying on
latestis a common source of “it worked yesterday” bugs when the upstream model updates. - State the quantization level explicitly when reporting a quality issue, since it’s often the first thing a reviewer will ask about and changes the diagnosis significantly.
- Distinguish context window limits from generation failures in bug reports — a truncated response and a crashed request look similar to a user but have different causes.
- Mention GPU offloading status when reporting performance, since partial offloading due to VRAM limits is the most common cause of unexpectedly slow local inference.
Practice Exercise
- Explain in one sentence what a Modelfile is used for.
- Write a bug report describing slow inference caused by partial GPU offloading.
- Describe, in your own words, the trade-off quantization level controls.