5 exercises — practice structuring strong English answers for LLMOps engineering interviews: RAG, evaluation, inference cost, monitoring, and model serving.
The interviewer asks: "How would you improve retrieval quality in a RAG system?" Which answer is most systematic?
Option B is strongest: it frames retrieval quality as a diagnostic problem with four named failure modes, explains WHY each failure mode occurs (not just what to do), introduces parent-child chunking as a nuanced technique many candidates miss, introduces Reciprocal Rank Fusion (RRF) as the merge strategy for hybrid retrieval, explains why cross-encoders are more accurate (joint scoring vs. cosine similarity), and closes with a complete evaluation framework. RAG vocabulary:Semantic chunking — splitting documents at semantic boundaries rather than fixed token counts. Parent-child chunking — indexing small chunks for precision but returning their larger parent for context. Dense retrieval — embedding-based similarity search (vector database). Sparse retrieval — keyword-based (BM25/TF-IDF). Reciprocal Rank Fusion (RRF) — a rank fusion method that combines ranked lists from multiple retrievers. Cross-encoder reranker — scores query-document pairs jointly; more accurate than bi-encoder similarity. RAGAS — RAG evaluation framework measuring faithfulness, relevancy, precision, recall. Options C and D are accurate but lack the failure mode diagnostic framing and the cross-encoder rationale.
2 / 5
The interviewer asks: "How do you evaluate LLM-generated answers at scale?" Which answer is most rigorous?
Option B is strongest: it organises evaluation into three distinct phases (offline, LLM-as-judge, online) with specific protocols for each, explains how each RAGAS metric is actually computed (not just named), names three specific LLM-as-judge biases with concrete mitigations for each, and gives the most nuanced online signal (clarification follow-up questions as implicit insufficiency signal). LLM evaluation vocabulary:Faithfulness — the answer contains only information grounded in the retrieved context. Answer relevancy — the answer addresses the question asked. Context recall — the retrieved context contains the necessary information. LLM-as-judge — using an LLM to score or compare generated outputs. Verbosity bias — judges prefer longer outputs regardless of quality. Positional bias — judges prefer whichever output was presented first. Shadow evaluation — routing traffic to both old and new configurations to compare without exposing users to the new model exclusively. Options C and D are accurate but lack the metric computation explanations and the clarification follow-up signal.
3 / 5
The interviewer asks: "What strategies would you use to reduce LLM inference cost?" Which answer is most complete?
Option B is strongest: it introduces five levers with concrete cost-reduction estimates (60-80%, 30-50%, 1-2% quality loss), explains prefix caching at the attention layer (a nuance not all LLMOps candidates know), introduces LLLingua specifically as a prompt compression tool, explains PagedAttention as the mechanism behind vLLM continuous batching, and closes with the cost accounting insight (per successful answer, not per token). LLMOps cost vocabulary:LLM router — a classifier that routes queries to the appropriate model tier. Semantic cache — serves cached responses for semantically similar past queries. KV-cache prefix caching — reuses cached attention keys/values for identical prompt prefixes across requests. Prompt compression — removes redundant tokens from long contexts before generation. Quantisation — reducing model weights from float32 to int8 or int4 to reduce memory. Continuous batching — serving multiple requests simultaneously by interleaving decode steps. PagedAttention — vLLM's memory management technique enabling continuous batching. Options C and D are accurate but lack the cost estimates and the per-successful-answer accounting insight.
4 / 5
The interviewer asks: "How do you detect when LLM output quality degrades in production?" Which answer is most practical?
Option B is strongest: it introduces the leading vs. lagging indicator framework first (the key insight for early detection), explains WHY embedding drift signals quality degradation (model generates qualitatively different outputs), introduces hedge word rate as a specific leading indicator many candidates miss, provides the per-response automated check layer as a separate concept from aggregate monitoring, and describes a structured root cause workflow. LLM monitoring vocabulary:Embedding drift — a shift in the distribution of output embeddings over time, indicating qualitative change. Hedge word rate — the proportion of outputs containing uncertainty phrases; a proxy for out-of-domain query rate. Faithfulness checker — verifies that each factual claim in a RAG output is grounded in retrieved context. Regression testing pipeline — automated evaluation on a golden dataset run on a schedule. Query distribution shift — a change in the type of queries users are sending, causing the model to encounter unfamiliar inputs. Options C and D are accurate but lack the leading/lagging framework explanation and the root cause workflow.
5 / 5
The interviewer asks: "Walk me through building a RAG pipeline from scratch." Which answer is most architectural?
Option B is strongest: it names all six components with specific design decisions and reasoning for each, names concrete tools for each component (Docling, Unstructured.io, Qdrant, Qdrant, LangSmith), introduces HyDE as a specific query rewriting technique (a nuance beyond simple query expansion), explains WHY parent-child chunking solves the precision vs. context tension, and closes with both online and offline observability. RAG pipeline vocabulary:Docling / Unstructured.io — document parsing libraries that preserve document structure. Parent-child chunking — indexes small chunks for retrieval, returns larger parent for context. HyDE (Hypothetical Document Embeddings) — asks the LLM to generate a hypothetical answer, embeds it, and retrieves passages similar to that embedding rather than the raw query. Reciprocal Rank Fusion (RRF) — merge strategy for combining dense and sparse retrieval results. Cross-encoder reranker — scores query-passage pairs jointly for more accurate ranking. Faithfulness check — verifies generated claims against retrieved context. LangSmith / Phoenix / Langfuse — LLM observability platforms for tracing and evaluation. Options C and D are accurate but lack the HyDE explanation and the specific tooling rationale.