English for LM Studio and Local LLM Developers
Vocabulary and phrases for developers running local large language models with LM Studio — quantization, context windows, GPU offloading, and model comparison talk for English-speaking teams.
Running large language models locally — on a laptop or an on-prem workstation — has its own vocabulary, distinct from cloud API talk. LM Studio, a popular desktop app for downloading and running open-weight models, puts terms like “quantization,” “context window,” and “GPU offload” front and center. If you work on a privacy-conscious product or an offline-first tool and need to explain your local-inference setup in English, this guide covers the terms you’ll reach for most often.
Getting Started Vocabulary
Local inference — running a model entirely on your own hardware, without sending data to an external API. “We switched the redaction step to local inference in LM Studio so no customer text ever leaves the laptop.”
Model weights — the trained parameters of a model, distributed as a downloadable file (often in GGUF format for local runners). “The 7B model weights are about 4.5GB once quantized — small enough to keep several versions on disk at once.”
GGUF — a file format designed for efficient local inference, widely supported by LM Studio and llama.cpp-based runners. “Make sure you download the GGUF build, not the raw safetensors — LM Studio won’t load those directly.”
Model card — the documentation accompanying a model release, describing its training data, intended use, and known limitations. “Before we shipped anything based on this model, we read the model card carefully — it explicitly says it wasn’t evaluated for medical use cases.”
Quantization Talk
Quantization — reducing the numerical precision of a model’s weights (for example from 16-bit to 4-bit) to shrink file size and speed up inference, at some cost to accuracy.
“We’re running the Q4_K_M quantization — it’s a good balance between quality and speed for a laptop with 16GB of RAM.”
Precision loss — the drop in output quality that can result from aggressive quantization.
“We noticed some precision loss on long reasoning tasks with the 4-bit version, so we bumped up to Q6 for anything involving multi-step logic.”
Perplexity — a metric used to compare how well a quantized model predicts text compared to the original, full-precision version.
“The perplexity numbers for Q4 versus Q8 were close enough that we didn’t think twice about using the smaller file.”
Hardware and Performance
GPU Offloading
GPU offloading — running some or all of a model’s layers on the GPU instead of the CPU, for faster inference. LM Studio lets you configure how many layers to offload.
“With 28 out of 32 layers offloaded to the GPU, we’re getting about four times the tokens per second compared to CPU-only.”
Tokens Per Second (TPS)
Tokens per second is the standard throughput metric for local inference — how quickly the model generates text.
“We benchmarked three models and picked the one with the best tokens-per-second on our target hardware, not just the best benchmark score.”
Context Window
The context window is the maximum number of tokens (input plus output) a model can process in one exchange. Larger context windows use more memory.
“We had to cap our context window at 8K tokens locally — the 32K version wouldn’t fit in memory alongside everything else running.”
VRAM Budget
Your VRAM budget is the amount of GPU memory available for loading model weights and the context window — a hard constraint when running locally.
“Once we account for the OS and other apps, our real VRAM budget is closer to 10GB, not the full 12GB on the card.”
Model Selection and Comparison
Instruction-tuned — a model fine-tuned to follow user instructions and hold conversations, as opposed to a raw “base model” that only continues text.
“Always check whether you’re loading the instruction-tuned variant — the base model will just ramble instead of answering your question.”
Fine-tune — a model further trained on a specific dataset to specialise its behaviour.
“This is a community fine-tune optimised for coding tasks — it performs noticeably better on our internal benchmark than the general-purpose base model.”
Benchmark leaderboard — a public ranking comparing models across standardised tasks, often used as a starting point (though not the final word) when picking a model.
“The leaderboard put it in the top five for our size class, but we still ran our own eval set before committing to it.”
Local-first — a product design philosophy where functionality works without a network connection or external API, often for privacy or reliability reasons.
“We’re building this feature local-first — if the local model handles 80% of queries well, we only need the cloud API as a fallback for the hard cases.”
Explaining Trade-Offs to Stakeholders
| Situation | Phrase |
|---|---|
| Justifying local inference over an API | ”Running this model locally means customer data never leaves the device — that’s a hard requirement for this feature.” |
| Explaining a quality trade-off | ”The smaller, quantized model is faster and fits on-device, but it’s noticeably worse at multi-step reasoning — we’re using it only for simple classification.” |
| Reporting a hardware limitation | ”We can’t run the 70B model locally on our target devices — we’d need to either use a smaller model or fall back to a cloud API for that tier.” |
| Describing a benchmarking process | ”We tested five quantization levels against our own eval set and picked the smallest one that stayed within two points of the full-precision baseline.” |
Common Mistakes
- Saying “the model is compressed” when the precise term is quantized — compression usually refers to file storage, quantization changes numerical precision.
- Confusing context window (token limit per exchange) with memory (persisted history across sessions) — they solve different problems.
- Describing GPU offloading as “using the GPU” without specifying how many layers — precision matters when reporting performance numbers to a team.
Practice Exercise
- Explain, in two or three sentences, why a company might choose local inference over a cloud API for a specific feature.
- Write a short comparison of two quantization levels (for example Q4 vs Q8) for a technical README.
- Draft a message to a teammate explaining that a feature was moved from cloud API to local-first inference, and why.