Practice vocabulary for evaluating LLMs in production applications: eval suites, hallucination rate tracking, LLM-as-judge, golden datasets, and continuous evaluation.
0 / 5 completed
1 / 5
The team says: 'Our ___ suite runs on every prompt change in CI.' What is an eval suite?
An eval suite is an automated set of test cases — each with an input prompt and an expected output or quality criterion — that runs in CI to catch regressions in LLM behaviour whenever the prompt, model, or retrieval logic changes.
2 / 5
The observability dashboard tracks ___ rate in production to monitor how often the model invents facts.
Hallucination rate measures how often the LLM produces outputs that contain invented, incorrect, or unsupported facts. Tracking it in production typically involves sampling outputs and having humans or a judge model label them as factual or hallucinated.
3 / 5
The team uses an ___ judge to rate response quality at scale instead of relying solely on human raters.
An LLM judge (LLM-as-judge) is a separate LLM prompted to evaluate another LLM's outputs according to a rubric — rating relevance, accuracy, tone, and completeness. It scales human-like evaluation to millions of examples, though it introduces its own biases.
4 / 5
Quality assurance is based on a ___ dataset of 500 hand-curated examples with verified correct answers.
A golden dataset is a manually curated set of examples where the correct answer or ideal output is known and verified. Evaluating against a golden dataset provides a reliable quality signal because the ground truth was established with care, not automatically generated.
5 / 5
The engineering team runs ___ evaluation to catch quality regressions before each release.
Continuous evaluation means the eval suite runs automatically in CI/CD — triggered by pull requests, prompt changes, or model updates. It catches quality regressions before they reach users, analogous to automated testing in traditional software development.