Evaluation harnesses, golden datasets, LLM-as-judge, regression testing, and modern eval tooling vocabulary.
Key vocabulary
Evaluation harness — the framework and infrastructure that runs eval cases against a model and collects results.
Golden dataset — a curated set of inputs with known correct outputs used as ground truth for evaluation.
LLM-as-judge — using a language model (often GPT-4 or Claude) to evaluate another model’s output at scale.
Eval suite — a collection of evaluation cases grouped by task type or quality dimension.
Regression testing for models — running the eval suite after each model change to catch quality degradations.
0 / 5 completed
1 / 5
Your team uses a “golden dataset” to evaluate a new model version. What is a golden dataset?
Golden dataset (also called a gold standard dataset) contains carefully vetted input-output pairs that represent the correct behaviour. Evaluation compares model outputs against these gold answers. Golden datasets must be kept separate from training data and refreshed to prevent contamination.
2 / 5
A team uses “LLM-as-judge” in their eval pipeline. What does this approach involve?
LLM-as-judge uses a frontier model as an automated evaluator — cheaper and more scalable than human annotation. It works well for open-ended tasks where there’s no single correct answer. Limitations include positional bias (preferring the first option) and sycophancy toward outputs that sound confident. Calibration against human labels is important.
3 / 5
After a model update, your eval suite shows a 4% drop in accuracy on customer support queries. In eval-as-code terms, this is called a:
Regression testing for models mirrors software regression testing: run the eval suite before and after a change and flag any score drops. This is foundational to treating model evaluation as code — evals run in CI, block deploys on regression, and alert the team. Tools like Braintrust and PromptFoo are built around this workflow.
4 / 5
What is the role of an “evaluation harness” in an eval-as-code setup?
Evaluation harness is the plumbing of an eval system: it reads test cases from a dataset, calls the model API, applies scoring logic (exact match, LLM-as-judge, custom metrics), and produces a structured report. Examples include EleutherAI’s lm-evaluation-harness (for open benchmarks) and Braintrust or PromptFoo (for production eval pipelines).
5 / 5
PromptFoo, Braintrust, and Langfuse are all tools primarily used for:
PromptFoo is an open-source CLI for testing and comparing prompts. Braintrust provides an eval platform with experiment tracking and human annotation. Langfuse focuses on LLM observability and eval tracking in production. All three embody eval-as-code principles: define evals in code, run them in CI, and track results over time.