AI and Machine Learning Vocabulary: LLM, RAG, Embeddings Explained
Plain-English definitions of 35 AI and machine learning terms: LLM, RAG, embeddings, tokens, hallucination, fine-tuning, prompt engineering, vector database, and more.
AI and machine learning have their own dense vocabulary — and since 2022, much of it has moved from research papers into everyday engineering conversations. If you work with AI tools, LLMs, or ML pipelines, you need to know this language. This guide explains the most important terms clearly, without assuming a mathematics background.
Large Language Models (LLMs)
LLM (Large Language Model)
A Large Language Model is a type of AI model trained on massive amounts of text data, capable of generating, translating, summarising, and answering questions in natural language. Examples: GPT-4, Claude, Gemini, Llama.
“We use an LLM to power the support chatbot.”
Token
In the context of LLMs, a token is a unit of text the model processes. A token is roughly 3–4 characters or about ¾ of a word in English. “Hello, world!” ≈ 4 tokens.
Why it matters: LLMs have a context window — a limit on how many tokens they can process at once. Pricing for cloud LLM APIs is usually per token.
Context Window
The context window is the total amount of text (in tokens) that an LLM can take in and consider in one request. Larger context windows allow more background information, longer conversations, or bigger documents.
“The model has a 128k context window — it can process the entire codebase at once.”
Prompt
A prompt is the input you give to an LLM. In a chat interface, it is your message. In an API call, it is the text you send to the model.
Prompt Engineering
Prompt engineering is the practice of designing and refining prompts to get better outputs from LLMs. Techniques include: providing examples (few-shot prompting), assigning a role (“You are a senior engineer…”), and structuring instructions clearly.
Hallucination
Hallucination is when an LLM confidently generates information that is factually incorrect, made up, or inconsistent with the context. A named but nonexistent function, a citation to a nonexistent paper, or wrong API documentation are all hallucinations.
“The model hallucinated a nonexistent library method — always verify LLM-generated code.”
System Prompt
A system prompt is a special instruction given to the LLM before the user’s input, setting its behaviour, role, or constraints. Usually not visible to end users.
Fine-Tuning
Fine-tuning is the process of continuing to train a pre-trained model on a smaller, specific dataset to specialise its behaviour. A general LLM can be fine-tuned on medical records, legal texts, or company-specific data.
RAG (Retrieval-Augmented Generation)
RAG is a technique that enhances LLM responses by retrieving relevant documents from a knowledge base and including them in the prompt. Instead of relying on the model’s trained memory, RAG retrieves fresh, relevant context before generating a response.
“We use RAG so the chatbot can answer questions about our internal documentation — it searches the latest docs before generating an answer.”
Embeddings & Vector Search
Embedding
An embedding is a numerical representation of text (or images, or other data) as a vector of floating-point numbers. Semantically similar texts have vectors that are close together in mathematical space. LLMs and search systems use embeddings to understand meaning, not just keywords.
“We embed user queries and search for the most similar document embeddings.”
Vector Database
A vector database is a database optimised for storing and searching embeddings. It can find vectors that are most similar to a query vector — enabling semantic search. Examples: Pinecone, Weaviate, Qdrant, pgvector (PostgreSQL extension).
Semantic Search
Semantic search finds results based on meaning, not just keyword matching. It uses embeddings to find documents that are conceptually similar to the query, even if they do not share exact words.
Model Training Concepts
Training Data
Training data is the dataset used to train a machine learning model. The quality and size of training data heavily influence model quality.
Overfitting
Overfitting occurs when a model learns the training data too specifically — including its noise — and performs poorly on new, unseen data.
Underfitting
Underfitting means the model has not learned enough from the training data and performs poorly even on training examples.
Inference
Inference is the process of using a trained model to make predictions on new data. Training is done once (or periodically); inference happens every time a user makes a request.
Ground Truth
Ground truth is the verified correct answer used to evaluate model predictions. In supervised learning, training data includes ground truth labels.
Label / Annotation
A label (or annotation) is the correct output associated with a training example. In image classification, a label might be “cat” or “dog.” Humans often annotate training data manually.
Supervised vs. Unsupervised Learning
- Supervised learning: the model learns from labelled examples (input → known output)
- Unsupervised learning: the model finds patterns in unlabelled data (e.g., clustering)
- Reinforcement learning: the model learns by receiving rewards or penalties for its actions
Evaluation & Performance
Accuracy
Accuracy = the proportion of correct predictions out of all predictions. But accuracy alone is misleading when classes are imbalanced (e.g., 99% of emails are not spam).
Precision and Recall
- Precision = of all predicted positives, how many were actually positive? (Avoids false positives)
- Recall = of all actual positives, how many did the model find? (Avoids false negatives)
F1 Score
The F1 score is the harmonic mean of precision and recall — a single number that balances both.
Benchmark
A benchmark is a standardised test used to measure model performance. LLM benchmarks: MMLU, HumanEval (coding), HellaSwag, BIG-bench.
AI Infrastructure & Tools
GPU / TPU
- GPU (Graphics Processing Unit) — originally for rendering, now the standard hardware for training and running neural networks
- TPU (Tensor Processing Unit) — Google’s custom chip, optimised specifically for machine learning
Model Weights
Model weights (or parameters) are the numerical values learned during training. When someone says “a 7B model,” they mean a model with 7 billion parameters.
Checkpoint
A checkpoint is a saved snapshot of model weights during training. Used to resume training and to evaluate model quality at different training stages.
Pipeline
In ML, a pipeline is a sequence of data processing steps — preprocessing, transformation, model inference, and post-processing — usually automated.
MLOps
MLOps (Machine Learning Operations) applies DevOps practices to machine learning: versioning models, automating training and deployment, monitoring model performance in production.
Practical Terms for Engineers Using LLM APIs
| Term | Meaning |
|---|---|
| Temperature | Controls randomness (0 = deterministic, 1+ = creative) |
| Top-p (nucleus sampling) | Alternative to temperature for controlling output diversity |
| Max tokens | The maximum output length |
| Stop sequence | A string that tells the model to stop generating |
| Few-shot | Providing examples in the prompt |
| Zero-shot | No examples — just the instruction |
| Chain-of-thought | Prompting the model to reason step by step |
| Streaming | Receiving output token by token as it is generated |
| API rate limit | Max requests per minute/hour/day from the model provider |