A hallucination is output that is fluent and confident but ungrounded — fabricated facts, invented citations, nonexistent API methods. Because LLMs are trained to produce probable next tokens, not to verify truth, they can generate convincing falsehoods. Evaluating and reducing hallucination is central to production LLM work. Techniques include retrieval-augmented generation (grounding answers in retrieved documents), measuring faithfulness (does the answer stay true to the provided context?), and citation requirements. Hallucination is especially dangerous in high-stakes domains like medicine or law.
2 / 5
What is an "eval set" (evaluation dataset) and why is it essential?
An eval set is a curated dataset of inputs — paired with reference answers or grading criteria — that you run your model/prompt against to measure quality objectively. It plays the role unit tests play in software: without it, prompt or model changes are guesswork ("it seems better"). A good eval set covers representative cases, edge cases, and known failure modes. Running it on every change catches regressions (a prompt tweak that fixes one case but breaks five others). Eval-driven development treats prompts and model choices as hypotheses validated against the eval set.
3 / 5
What is the "LLM-as-judge" evaluation technique?
LLM-as-judge uses a strong model (the judge) to score outputs against criteria — relevance, correctness, tone, faithfulness — replacing or augmenting expensive human evaluation. You give the judge the input, the output, and a rubric, and it returns a score and rationale. This scales evaluation to thousands of cases cheaply. Caveats: judges have biases (favoring longer answers, their own style, position bias), so practitioners calibrate the judge against human labels, use pairwise comparison instead of absolute scoring, and validate that the judge correlates with human judgment before trusting it.
4 / 5
In the RAGAS framework for evaluating RAG systems, what does "faithfulness" measure?
Faithfulness measures whether the answer is grounded in the retrieved documents — every claim in the answer should be supported by the context, with no fabrication. It is distinct from answer relevancy (does the answer address the question?) and context relevancy/precision (did retrieval fetch the right documents?). RAGAS decomposes RAG quality into these complementary metrics because a RAG system can fail in different ways: retrieving wrong docs, retrieving right docs but ignoring them, or answering off-topic. Faithfulness specifically targets hallucination relative to the provided context.
5 / 5
What is a "benchmark" like MMLU or HumanEval used for?
A benchmark is a standardized, public dataset and scoring method for comparing models on a capability. MMLU (Massive Multitask Language Understanding) tests broad academic knowledge across 57 subjects via multiple-choice questions. HumanEval measures functional code generation — does the produced code pass hidden unit tests? Benchmarks enable leaderboards and standardized comparison, but they have limits: contamination (benchmark data leaking into training), saturation (top models all near 100%), and gaming. They measure narrow proxies, so production systems still need task-specific eval sets reflecting real use.