Advanced Vocabulary #llmops#rag#ai#mlplatform

LLMOps & ML Platform Vocabulary

5 exercises — Practice LLMOps vocabulary in English: RAG, embeddings, vector stores, hallucination, evaluation pipelines, context windows, and inference cost.

Core LLMOps vocabulary clusters

RAG: retrieval-augmented generation, embedding, vector store, chunking, semantic search, retrieval quality, reranking
Evaluation: faithfulness, relevance, groundedness, LLM-as-judge, eval pipeline, hallucination detection
Context: context window, token, prompt, system prompt, few-shot, context stuffing, lost-in-the-middle
Deployment: inference, latency, throughput, batching, quantization, KV cache, model serving, cold start
Fine-tuning: PEFT, LoRA, instruction tuning, base model vs. fine-tuned, RLHF, DPO

0 / 5 completed

1 / 5

An ML engineer explains RAG architecture to the product team:
"RAG — Retrieval-Augmented Generation — solves the knowledge cutoff problem. Instead of relying only on what the LLM learned during training, we retrieve relevant documents from a knowledge base at inference time and inject them into the prompt. The pipeline: user query → embed the query → vector search → retrieve top-K chunks → inject into prompt context → LLM generates a grounded answer. The quality of the answer depends on retrieval quality, not just the LLM."
What problem does RAG solve, and what is the role of embeddings in the pipeline?

2 / 5

An LLMOps engineer presents evaluation metrics to the team:
"We track three core RAG metrics. Faithfulness: does the generated answer stay within the retrieved context, or does the model add unsupported claims? Relevance: does the answer actually address the user's question? Groundedness: can every factual claim in the answer be traced to a specific chunk? We use LLM-as-judge for these — a second LLM rates each response on a 1-5 scale with reasoning. It's not perfect but it scales."
What is hallucination in the context of LLMs, and why is faithfulness an important metric?

3 / 5

An ML engineer discusses context window management:
"GPT-4 has a 128K token context window. That sounds enormous — but context stuffing (putting everything you have into the prompt) degrades quality. There's a well-documented 'lost in the middle' phenomenon: LLMs attend strongly to the beginning and end of the context but miss information in the middle. For RAG, this means you want the most relevant chunks at the top and bottom of the context, not buried in the middle."
What is the lost-in-the-middle problem and how does it affect RAG design?

4 / 5

A platform engineer discusses inference costs:
"At scale, inference costs dominate. The main levers: token reduction (shorter prompts, fewer retrieved chunks), caching (if the same query appears often, cache the response), quantization (run INT8 or INT4 instead of FP16 — 2-4x cost reduction with minor quality loss), batching (process multiple requests together to improve GPU utilization). For latency-sensitive paths, we trade cost for quality; for batch jobs, we reverse it."
What is quantization in the context of LLM inference, and what trade-off does it introduce?

5 / 5

An ML lead explains fine-tuning options:
"Fine-tuning a 70B model full-precision requires 8+ A100 GPUs just for the optimizer states. PEFT methods — specifically LoRA — dramatically reduce this. LoRA adds small trainable rank-decomposition matrices to the attention layers; only these small matrices are trained, not the original weights. You can fine-tune a 7B model on a single A100 with LoRA. The result: domain adaptation with 10-100x less compute. For instruction following, we often use DPO instead of RLHF — similar results, simpler training loop."
What is LoRA and why does it reduce the compute cost of fine-tuning?