Fine-tuning: PEFT, LoRA, instruction tuning, base model vs. fine-tuned, RLHF, DPO
0 / 5 completed
1 / 5
An ML engineer explains RAG architecture to the product team: "RAG — Retrieval-Augmented Generation — solves the knowledge cutoff problem. Instead of relying only on what the LLM learned during training, we retrieve relevant documents from a knowledge base at inference time and inject them into the prompt. The pipeline: user query → embed the query → vector search → retrieve top-K chunks → inject into prompt context → LLM generates a grounded answer. The quality of the answer depends on retrieval quality, not just the LLM." What problem does RAG solve, and what is the role of embeddings in the pipeline?
RAG (Retrieval-Augmented Generation): augments an LLM's response by retrieving relevant context from an external knowledge base at inference time. Solves: knowledge cutoff (training data is stale), proprietary knowledge (can't fine-tune on confidential data), context window limits (can't put everything in the prompt). Embeddings: dense numerical representations (vectors) of text where semantically similar texts are geometrically close. Generated by embedding models (OpenAI text-embedding-3, Cohere embed, BGE). RAG pipeline in detail: 1) Indexing (offline): load documents → chunk → embed each chunk → store in vector store. 2) Retrieval (online): embed user query → vector similarity search → return top-K most similar chunks. 3) Generation: inject retrieved chunks into prompt → LLM generates grounded answer. Vector stores: Pinecone, Weaviate, Qdrant, pgvector, Chroma. Vocabulary: Chunk: a segment of a source document. Chunking strategy affects retrieval quality significantly. Top-K: number of chunks returned. Too many: noise and context stuffing. Too few: missing context. Semantic search: search by meaning (embedding similarity) vs. keyword matching. Grounded: the LLM's answer is supported by retrieved evidence. Hallucination: LLM generates plausible-sounding but incorrect facts. In conversation: 'RAG quality is 80% retrieval quality. If you retrieve the wrong chunks, even GPT-4 gives wrong answers.'
2 / 5
An LLMOps engineer presents evaluation metrics to the team: "We track three core RAG metrics. Faithfulness: does the generated answer stay within the retrieved context, or does the model add unsupported claims? Relevance: does the answer actually address the user's question? Groundedness: can every factual claim in the answer be traced to a specific chunk? We use LLM-as-judge for these — a second LLM rates each response on a 1-5 scale with reasoning. It's not perfect but it scales." What is hallucination in the context of LLMs, and why is faithfulness an important metric?
Hallucination: the LLM generates information that sounds plausible but is not grounded in the provided context or real facts. In RAG specifically: the model goes beyond the retrieved chunks to add its own 'knowledge' — which may be wrong. Two types: Intrinsic hallucination: contradicts provided context. Extrinsic hallucination: adds information not supported by context (may be true, but unverifiable). Faithfulness: every claim in the output can be attributed to the input context. High faithfulness = the LLM only says what the documents say. RAG evaluation vocabulary: Faithfulness: output ⊆ retrieved context. Catches hallucinations. Answer relevance: output addresses the query. Catches off-topic answers. Context relevance / recall: retrieved chunks contain the information needed to answer. Catches retrieval failures. Groundedness: each sentence can be traced to a specific source. LLM-as-judge: use a separate LLM (GPT-4) to evaluate outputs on a rubric. Scalable but has its own biases. Ragas: open-source RAG evaluation framework. ARES: another evaluation framework. Human eval: manual evaluation by annotators. Gold standard but expensive. In conversation: 'Faithfulness and relevance are the two numbers we optimize for. If faithfulness is low, we have a hallucination problem. If relevance is low, we have a retrieval problem.'
3 / 5
An ML engineer discusses context window management: "GPT-4 has a 128K token context window. That sounds enormous — but context stuffing (putting everything you have into the prompt) degrades quality. There's a well-documented 'lost in the middle' phenomenon: LLMs attend strongly to the beginning and end of the context but miss information in the middle. For RAG, this means you want the most relevant chunks at the top and bottom of the context, not buried in the middle." What is the lost-in-the-middle problem and how does it affect RAG design?
Lost-in-the-middle (Liu et al., 2023): LLMs show a U-shaped attention pattern — they attend strongly to the beginning and end of the context but are less reliable about information in the middle. Practical implications for RAG: put the most relevant retrieved chunk first (or last) in the context. Don't rely on the LLM to find a needle in the middle of 10 chunks. Context window vocabulary: Context window: the maximum number of tokens an LLM can process in a single inference. Includes: system prompt + user message + retrieved chunks + conversation history + output. Token: roughly 0.75 words in English. GPT tokenizes differently from Llama. Prompt: the full input to the LLM. System prompt: instructions at the beginning that persist through a conversation. Few-shot: including examples in the prompt to guide output format. Context stuffing: including excessive context hoping the LLM finds what's relevant — often degrades quality. Context compression: summarising or filtering retrieved chunks before injecting to reduce token use. Reranking: after initial retrieval, use a cross-encoder model to reorder chunks by relevance — more accurate than vector similarity alone. In conversation: 'Lost-in-the-middle is why context window size doesn't linearly translate to answer quality. We now explicitly order retrieved chunks: most relevant first and last.'
4 / 5
A platform engineer discusses inference costs: "At scale, inference costs dominate. The main levers: token reduction (shorter prompts, fewer retrieved chunks), caching (if the same query appears often, cache the response), quantization (run INT8 or INT4 instead of FP16 — 2-4x cost reduction with minor quality loss), batching (process multiple requests together to improve GPU utilization). For latency-sensitive paths, we trade cost for quality; for batch jobs, we reverse it." What is quantization in the context of LLM inference, and what trade-off does it introduce?
Quantization: reducing the bit-width of model weights and/or activations to decrease memory footprint and increase inference throughput. Original training: FP32 (32-bit) or FP16 (16-bit). Quantized: INT8 (8-bit), INT4 (4-bit), GPTQ, AWQ. Effect: a 70B parameter model in FP16 requires ~140GB GPU RAM. In INT4: ~35GB — fits on 2 A100s instead of 8. Speed: INT8 typically 2x faster than FP16 on supported hardware. Quality: minimal degradation at INT8; noticeable on complex math/reasoning at INT4. Vocabulary: FP16 / BF16: 16-bit floating point — standard for LLM training and inference. INT8: 8-bit integer — good quality/cost trade-off. INT4 / GPTQ / AWQ: 4-bit quantization methods — significant compression, some quality loss. KV cache: cached key/value attention states from previous tokens; reusing it speeds up autoregressive generation. Batching: processing multiple inference requests together to improve GPU utilisation. Continuous batching: dynamically adds new requests to a running batch — higher throughput than static batching. vLLM: LLM serving framework using PagedAttention for efficient KV cache management. Cold start: the latency cost of loading a model into GPU memory before first inference — mitigated by keeping models warm. In conversation: 'INT8 quantization is usually safe — we've seen less than 1% quality drop on our eval benchmarks and 2x inference throughput improvement.'
5 / 5
An ML lead explains fine-tuning options: "Fine-tuning a 70B model full-precision requires 8+ A100 GPUs just for the optimizer states. PEFT methods — specifically LoRA — dramatically reduce this. LoRA adds small trainable rank-decomposition matrices to the attention layers; only these small matrices are trained, not the original weights. You can fine-tune a 7B model on a single A100 with LoRA. The result: domain adaptation with 10-100x less compute. For instruction following, we often use DPO instead of RLHF — similar results, simpler training loop." What is LoRA and why does it reduce the compute cost of fine-tuning?
LoRA (Low-Rank Adaptation) (Hu et al., 2021): Instead of updating all model weights during fine-tuning, LoRA freezes the original weights and injects trainable rank-decomposition matrices at each attention layer. For a weight matrix W of shape (d, k): LoRA adds W + BA where B is (d, r) and A is (r, k) with r ≪ d,k. Only A and B are trained. Trainable parameters: instead of d×k (e.g., 4096×4096 = 16M), only 2×r×4096 (e.g., r=8: 65K). 240x fewer parameters. Fine-tuning vocabulary: PEFT (Parameter-Efficient Fine-Tuning): methods that fine-tune a small subset of parameters. LoRA, QLoRA, adapters, prefix tuning. Full fine-tuning: update all model weights. Requires storing gradients and optimizer states for every parameter. QLoRA: LoRA on a quantized (INT4) base model — fine-tune a 65B model on a single A100. Instruction tuning: fine-tune on (instruction, response) pairs to improve task-following. RLHF: fine-tune using human preference rankings. Complex training pipeline (SFT → reward model → PPO). DPO (Direct Preference Optimization): achieves similar results to RLHF without a separate reward model. Simpler, more stable training. Base model: pre-trained only; requires prompting to produce useful outputs. Chat/instruct model: fine-tuned for dialogue and instruction following. In conversation: 'QLoRA made domain-specific fine-tuning accessible — you don't need a 32-GPU cluster anymore. A team with one A100 can now adapt a 7B model to their domain in hours.'