AI and Machine Learning Vocabulary Every Developer Should Know
A practical guide to AI and LLM vocabulary for developers — tokens, RAG, fine-tuning, hallucination, context window, and more with real usage examples.
You don’t need to be a machine learning researcher to work with AI systems. But you do need to understand the vocabulary — so you can read documentation, discuss integrations with your team, and evaluate tools intelligently. This guide covers the essential AI and LLM vocabulary that every developer working in or around AI systems should know.
Large Language Model (LLM) Vocabulary
| Term | Definition | Practical note |
|---|---|---|
| Token | The basic unit of text an LLM processes — roughly 0.75 words in English | API costs are usually priced per token |
| Context window | The maximum amount of text (in tokens) an LLM can process at once | Longer context = more expensive but more aware |
| Temperature | A setting controlling how random the model’s outputs are (0 = deterministic, 1+ = creative) | Use low temperature for code generation, higher for creative tasks |
| Hallucination | When an LLM generates confident-sounding but incorrect or fabricated information | Always verify LLM outputs for factual claims |
| Prompt | The input text you send to an LLM | Prompt engineering is the skill of crafting effective prompts |
| System prompt | Instructions given to the model before the user’s message | Used to set behaviour, tone, and constraints |
Retrieval-Augmented Generation (RAG)
RAG is a pattern where relevant documents are retrieved from a knowledge base and provided to the LLM as context before it generates a response. This grounds the model’s output in real data and reduces hallucination.
Key vocabulary:
- Embedding — A numerical vector representing a piece of text, used for similarity search
- Vector database — A database optimised for storing and querying embeddings
- Retrieval — The step of finding relevant documents based on the user’s query
- Chunking — Splitting documents into smaller pieces for embedding and retrieval
- Grounding — Connecting an LLM’s response to specific retrieved sources
“We implemented RAG to prevent the assistant from hallucinating product details — it now retrieves the relevant product specification before generating a response.”
Fine-Tuning Vocabulary
Fine-tuning means training an existing pre-trained model further on your own data. This adapts it to a specific domain or task.
| Term | Meaning |
|---|---|
| Base model | The original pre-trained model before any fine-tuning |
| Fine-tuning | Continuing training on a domain-specific dataset to adapt the model’s behaviour |
| LoRA | Low-Rank Adaptation — an efficient fine-tuning technique that updates fewer parameters |
| Instruction tuning | Training a model to follow instructions, typically using question-answer pairs |
| RLHF | Reinforcement Learning from Human Feedback — used to align models with human preferences |
Fine-tuning is powerful but expensive. Many teams find that prompt engineering and RAG can achieve their goals without the cost and complexity of fine-tuning.
Model Evaluation Terms
| Term | Meaning |
|---|---|
| Benchmark | A standardised test used to compare model performance |
| Precision | Of all items the model labelled positive, how many were actually positive? |
| Recall | Of all actual positive items, how many did the model correctly identify? |
| F1 score | The harmonic mean of precision and recall — useful when both matter |
| Evals | Short for evaluations — the process of measuring an LLM’s output quality |
A Note on “Evals”
In LLM development, “running evals” has become standard vocabulary. It means systematically testing model outputs against expected answers or quality criteria. Unlike traditional software testing, evals often involve human raters or judge models (another LLM used to evaluate the first).
Example Sentences
- “The context window limitation means we can’t feed the entire document to the model at once — we’ll need to implement chunking and retrieval.”
- “We noticed the model was hallucinating API endpoint names, so we added the API documentation to the system prompt to ground its responses.”
- “Temperature is set to 0.2 for the code-generation feature because we want deterministic, reproducible outputs.”
- “After running evals on 500 representative queries, we found that the RAG pipeline improved factual accuracy by 34% compared to the base model.”
- “Fine-tuning on our internal support tickets took three days and improved the model’s ability to classify issue severity, but RAG would have been faster to set up.”