LLM-as-judge questions: name the bias → explain its mechanism → give a concrete mitigation → describe calibration against human labels
Hallucination questions: intrinsic vs. extrinsic taxonomy → detection method per type → scale strategy (tiered) → metric choice
RAGAS questions: name metric → explain how it is computed (not just what it measures) → flag limitations
Contamination questions: why it matters → detection methods in order of reliability → mitigations
0 / 5 completed
1 / 5
The interviewer asks: "How would you design a comprehensive benchmark suite to evaluate a large language model before production release?" Which answer is most rigorous?
Option B is strongest. It names all four evaluation axes with specific benchmarks for each, explains what each benchmark measures (not just its name — for example MMLU = 57-subject breadth, HumanEval = unit test pass rate, MT-Bench = multi-turn instruction following), introduces HELM as a multi-dimensional framework covering calibration, robustness, and fairness (not just accuracy), emphasises task-specific golden datasets as more valuable than public benchmarks for production use, and critically introduces dataset contamination as a validity threat that inflated public scores create. LLM evaluation vocabulary:HELM — Stanford's Holistic Evaluation framework covering 7 dimensions across 42 scenarios. HumanEval — OpenAI's code benchmark measuring functional correctness via unit tests. MT-Bench — multi-turn instruction following benchmark using GPT-4 as judge. TruthfulQA — benchmark for truthfulness under adversarial prompting. Dataset contamination — when benchmark test data appears in training data, inflating scores. N-gram overlap — a contamination detection method comparing n-grams between training and test data. Options C and D are accurate but lack the reasoning behind each benchmark's purpose and the contamination detection mechanism.
2 / 5
The interviewer asks: "What are the failure modes of LLM-as-judge evaluation, and how do you mitigate them?" Which answer is most complete?
Option B is strongest. It names all five failure modes (not four), provides a specific mitigation for each that goes beyond generic advice, introduces calibration error as a distinct failure mode separate from bias (a nuance many candidates miss), explains the use of anchor examples in rubrics (a concrete technique interviewers at evaluation-heavy companies like Anthropic or Google DeepMind will recognise), and specifies Krippendorff's alpha and Cohen's kappa as the correct inter-rater agreement metrics. The phrase "meta-evaluate LLM judge against human holdout" frames the key principle: you must validate your judge before trusting it at scale. LLM-as-judge vocabulary:Verbosity bias — systematic preference for longer outputs. Positional bias — preference for outputs based on their position in a prompt. Sycophancy bias — changing judgement based on perceived authority of the source. Calibration — consistency and accuracy of score assignments. Krippendorff's alpha — inter-rater reliability coefficient for ordinal scales. Anchor examples — labelled examples that define each score level in a rubric. Options C and D name the failure modes correctly but lack the anchor-rubric technique and the specific statistical metrics for calibration.
3 / 5
The interviewer asks: "How do you detect and measure hallucination in LLM outputs at scale?" Which answer is most systematic?
Option B is strongest. It introduces the critical intrinsic vs. extrinsic hallucination taxonomy (the conceptual foundation of all hallucination research), explains the NLI-based mechanism in precise technical terms (Entailment / Neutral / Contradiction labels, not just "check if it matches"), names specific tools (TRUE, MiniCheck), presents three distinct strategies for extrinsic hallucination with honest cost trade-offs, introduces the tiered detection pattern for cost efficiency at scale, and names three different benchmark metrics appropriate for different task types (RAGAS for RAG, FactScore for biography, SummEval for summarisation). Hallucination detection vocabulary:Intrinsic hallucination — contradicts information present in the input context. Extrinsic hallucination — introduces information not present in the input (may be true or false). NLI (Natural Language Inference) — classifying whether one text entails, contradicts, or is neutral to another. MiniCheck / TRUE — lightweight NLI models for faithfulness checking. FactScore — decomposes claims and scores each against a knowledge source. Tiered detection — applying cheap filters before expensive checks to reduce cost at scale. Options C and D are accurate but lack the tiered detection cost rationale and the task-specific metric selection.
4 / 5
The interviewer asks: "Walk me through how RAGAS evaluates a RAG pipeline. What do each of its metrics measure and how are they computed?" Which answer is most precise?
Option B is strongest. It explains each RAGAS metric at the algorithmic level — specifically describing how each metric is computed, not just what it measures — which is the level of depth a senior LLM evaluation engineer at a company like Cohere, Anthropic, or a RAG-product team is expected to know. The answer reveals the elegant reverse-question technique for Answer Relevancy (generate questions the answer implies, measure similarity to original question — a genuinely creative metric design), explains why Context Precision and Recall are distinct axes (the retriever can be precise but not recall enough, or recall too much irrelevant content), and importantly flags the limitation that Context Recall requires reference answers, making it semi-automated. RAGAS vocabulary:Atomic claim decomposition — breaking an answer into minimal factual units for entailment checking. Hypothetical question generation — generating what questions an answer implies, used to measure relevancy. Context precision — proportion of retrieved chunks that are useful. Context recall — proportion of reference answer information attributable to retrieved context. Golden dataset — a curated set of queries with human-verified reference answers. Options C and D are accurate but present the computation as bullet points without the explanatory reasoning that shows genuine understanding.
5 / 5
The interviewer asks: "How do you detect dataset contamination, and why does it matter for benchmark validity?" Which answer is most complete?
Option B is strongest. It opens with the impact statement (contamination inflates scores without reflecting generalisation — a direct answer to "why does it matter"), then presents four detection methods in order of reliability with specific technical details that distinguish an expert from a generalist: the exact n-gram lengths used by PaLM (8-gram) and LLaMA (13-gram) papers, the canary string mechanism used by benchmark maintainers, the white-box log-probability requirement for membership inference attacks, and the behavioural analysis approach exemplified by LiveBench. Closing with three actionable mitigations completes the answer. Dataset contamination vocabulary:N-gram overlap — measuring shared n-grams between training corpus and test set. Canary string — a unique planted string used to detect whether a model was trained on a specific document. Membership inference attack — probing whether a model has memorised specific examples by comparing probabilities. Decontamination filtering — removing benchmark test examples from training data during curation. LiveBench — a continuously updated benchmark that refreshes examples to prevent contamination. Options C and D name the methods correctly but lack the paper-specific n-gram lengths, the white-box caveat for membership inference, and the impact framing.