AI and Machine Learning Vocabulary Every Developer Should Know

A practical guide to AI and LLM vocabulary for developers — tokens, RAG, fine-tuning, hallucination, context window, and more with real usage examples.

You don’t need to be a machine learning researcher to work with AI systems. But you do need to understand the vocabulary — so you can read documentation, discuss integrations with your team, and evaluate tools intelligently. This guide covers the essential AI and LLM vocabulary that every developer working in or around AI systems should know.

Large Language Model (LLM) Vocabulary

TermDefinitionPractical note
TokenThe basic unit of text an LLM processes — roughly 0.75 words in EnglishAPI costs are usually priced per token
Context windowThe maximum amount of text (in tokens) an LLM can process at onceLonger context = more expensive but more aware
TemperatureA setting controlling how random the model’s outputs are (0 = deterministic, 1+ = creative)Use low temperature for code generation, higher for creative tasks
HallucinationWhen an LLM generates confident-sounding but incorrect or fabricated informationAlways verify LLM outputs for factual claims
PromptThe input text you send to an LLMPrompt engineering is the skill of crafting effective prompts
System promptInstructions given to the model before the user’s messageUsed to set behaviour, tone, and constraints

Retrieval-Augmented Generation (RAG)

RAG is a pattern where relevant documents are retrieved from a knowledge base and provided to the LLM as context before it generates a response. This grounds the model’s output in real data and reduces hallucination.

Key vocabulary:

  • Embedding — A numerical vector representing a piece of text, used for similarity search
  • Vector database — A database optimised for storing and querying embeddings
  • Retrieval — The step of finding relevant documents based on the user’s query
  • Chunking — Splitting documents into smaller pieces for embedding and retrieval
  • Grounding — Connecting an LLM’s response to specific retrieved sources

“We implemented RAG to prevent the assistant from hallucinating product details — it now retrieves the relevant product specification before generating a response.”

Fine-Tuning Vocabulary

Fine-tuning means training an existing pre-trained model further on your own data. This adapts it to a specific domain or task.

TermMeaning
Base modelThe original pre-trained model before any fine-tuning
Fine-tuningContinuing training on a domain-specific dataset to adapt the model’s behaviour
LoRALow-Rank Adaptation — an efficient fine-tuning technique that updates fewer parameters
Instruction tuningTraining a model to follow instructions, typically using question-answer pairs
RLHFReinforcement Learning from Human Feedback — used to align models with human preferences

Fine-tuning is powerful but expensive. Many teams find that prompt engineering and RAG can achieve their goals without the cost and complexity of fine-tuning.

Model Evaluation Terms

TermMeaning
BenchmarkA standardised test used to compare model performance
PrecisionOf all items the model labelled positive, how many were actually positive?
RecallOf all actual positive items, how many did the model correctly identify?
F1 scoreThe harmonic mean of precision and recall — useful when both matter
EvalsShort for evaluations — the process of measuring an LLM’s output quality

A Note on “Evals”

In LLM development, “running evals” has become standard vocabulary. It means systematically testing model outputs against expected answers or quality criteria. Unlike traditional software testing, evals often involve human raters or judge models (another LLM used to evaluate the first).

Example Sentences

  1. “The context window limitation means we can’t feed the entire document to the model at once — we’ll need to implement chunking and retrieval.”
  2. “We noticed the model was hallucinating API endpoint names, so we added the API documentation to the system prompt to ground its responses.”
  3. “Temperature is set to 0.2 for the code-generation feature because we want deterministic, reproducible outputs.”
  4. “After running evals on 500 representative queries, we found that the RAG pipeline improved factual accuracy by 34% compared to the base model.”
  5. “Fine-tuning on our internal support tickets took three days and improved the model’s ability to classify issue severity, but RAG would have been faster to set up.”