5 exercises — the vocabulary every developer working with AI needs in English: RAG architecture, hallucination, AI agents, model quantisation, and prompting techniques.
An ML engineer explains their system: "Instead of searching the entire document database for every query, we first convert documents to vectors using an embedding model and store them in a vector database. At query time, we embed the question and retrieve the k nearest neighbours — semantically similar documents — then pass them as context to the LLM." What architecture is described here?
RAG (Retrieval-Augmented Generation) is an architecture for giving LLMs access to external knowledge without retraining. The retrieval step finds relevant documents; the generation step uses them as context in the prompt. Key RAG vocabulary: Embedding — a dense vector representation of text that captures semantic meaning; similar texts produce vectors that are close in high-dimensional space. Vector database — a database optimised for storing and querying vectors by similarity (Pinecone, Weaviate, Chroma, pgvector). Semantic search — search by meaning rather than keyword matching. k-NN (k-Nearest Neighbours) — retrieve the k most similar vectors to the query. Chunking — splitting documents into smaller pieces before embedding, to fit context window limits. Context window — the maximum number of tokens an LLM can process in a single call. Grounding — connecting LLM output to specific, verifiable source documents. RAG vs fine-tuning: RAG is preferred when the knowledge changes frequently; fine-tuning is preferred when you need to change the model's behaviour or style.
2 / 5
A product manager reads a model evaluation report: "The model hallucinated in 3% of responses — confidently stating facts that were not present in the source documents and could not be verified." What does hallucination mean in the context of LLMs?
Hallucination in LLMs is when the model generates content that is factually incorrect, fabricated, or not grounded in any real source — often with high confidence and plausible language. LLMs don't "know" facts the way a database does; they predict statistically likely next tokens. When the training data has conflicting or sparse information about a topic, the model may confabulate rather than say "I don't know." Why hallucinations happen: LLMs optimise for fluency and coherence, not factual accuracy. They generalise from patterns. They have no mechanism to verify real-world facts. Categories: Factual hallucination — wrong facts stated confidently (e.g., citing a paper that doesn't exist). Instruction hallucination — failing to follow constraints while claiming to. Context hallucination — contradicting information provided in the prompt. Mitigation strategies: RAG (grounding in source documents), self-consistency checks, output verification pipelines, Constitutional AI, RLHF. In conversation: "Never use this model for medical or legal information without a human-in-the-loop review — hallucination rates are too high for high-stakes outputs."
3 / 5
An AI engineer writes in their architecture doc: "The agent uses a ReAct loop — it reasons about the task, selects a tool, observes the result, and repeats until it reaches a final answer or hits the max iteration limit." What is an AI agent in this context?
An AI agent is an LLM-powered system that can autonomously plan, use tools, and iterate across multiple steps to complete a goal. Unlike a single prompt → response interaction, an agent operates in a loop. ReAct (Reason + Act) is a prompting strategy where the model alternates between reasoning about the next action and executing it, using the observation as input to the next reasoning step. Agent vocabulary: Tool use — the model calls external functions (web search, code execution, database query, API calls). Function calling / tool calling — structured way to invoke external tools from within the model. Planning — decomposing a task into sub-steps. Memory — short-term (context window) vs long-term (vector DB, external store). Orchestration — managing multiple agents or steps (frameworks: LangChain, LlamaIndex, AutoGen, CrewAI). Max iterations — safety limit to prevent infinite loops. System prompt — instructions that define the agent's persona, capabilities, and constraints. In conversation: "We moved from a single-shot prompt to an agent because the task required searching the web, running code, and synthesising results from multiple sources."
4 / 5
A team discusses model deployment cost: "The 70B model gives better results but inference is too expensive at scale. We're exploring quantisation — going from FP16 to INT4 — to run it on a single GPU." What does quantisation mean?
Quantisation is the process of representing model weights using fewer bits — reducing memory and computational requirements at some accuracy trade-off. Precision levels: FP32 — 32-bit floating point (training standard). FP16 / BF16 — 16-bit (common for inference). INT8 — 8-bit integer (~4× smaller than FP32). INT4 — 4-bit (aggressive, but surprisingly good quality with methods like GPTQ, AWQ). Common quantisation tools: GPTQ — post-training quantisation for transformers. AWQ (Activation-aware Weight Quantisation) — preserves accuracy better at INT4. llama.cpp / GGUF format — enables running quantised models on CPU/Mac. bitsandbytes — 8-bit/4-bit quantisation library for PyTorch. Why it matters: a 70B model in FP16 requires ~140 GB VRAM; in INT4 it requires ~35 GB — fitting on a single high-end GPU. Related terms: VRAM — GPU memory. Batch size — number of inputs processed in parallel. Throughput — tokens generated per second. Latency — time to first token. In conversation: "We quantised the model to 4-bit and got 15% accuracy degradation on our benchmark — acceptable for our use case but not for medical applications."
5 / 5
A developer describes their prompting approach: "I use chain-of-thought prompting — I add 'think step by step' to complex reasoning tasks. The model's performance on multi-step problems improved significantly compared to direct-answer prompts." What is chain-of-thought (CoT) prompting?
Chain-of-Thought (CoT) prompting is a technique where you instruct the LLM to reason through a problem step by step before giving a final answer. This dramatically improves performance on arithmetic, logical reasoning, and multi-step tasks. The phrase "Let's think step by step" (Kojima et al., 2022) is the canonical zero-shot CoT trigger. Prompting techniques vocabulary: Zero-shot prompting — asking the model without any examples. Few-shot prompting — providing 2–5 examples in the prompt. Zero-shot CoT — "think step by step" without examples. Few-shot CoT — examples that include reasoning steps. Self-consistency — generate multiple CoT solutions and take the majority answer. Tree of Thoughts (ToT) — explore multiple reasoning paths and evaluate them. System prompt — instructions at the start of the context that shape the model's behaviour. Temperature — controls randomness: 0 = deterministic, 1+ = creative/varied. Top-p / nucleus sampling — alternative randomness control. Why CoT works: it forces the model to "use" intermediate computations in its context window rather than jumping directly to an answer, reducing errors in complex reasoning.