Eval-as-Code Vocabulary

Evaluation harnesses, golden datasets, LLM-as-judge, regression testing, and modern eval tooling vocabulary.

Key vocabulary

Evaluation harness — the framework and infrastructure that runs eval cases against a model and collects results.
Golden dataset — a curated set of inputs with known correct outputs used as ground truth for evaluation.
LLM-as-judge — using a language model (often GPT-4 or Claude) to evaluate another model’s output at scale.
Eval suite — a collection of evaluation cases grouped by task type or quality dimension.
Regression testing for models — running the eval suite after each model change to catch quality degradations.

0 / 5 completed

1 / 5

Your team uses a “golden dataset” to evaluate a new model version. What is a golden dataset?

2 / 5

A team uses “LLM-as-judge” in their eval pipeline. What does this approach involve?

3 / 5

After a model update, your eval suite shows a 4% drop in accuracy on customer support queries. In eval-as-code terms, this is called a:

4 / 5

What is the role of an “evaluation harness” in an eval-as-code setup?

5 / 5

PromptFoo, Braintrust, and Langfuse are all tools primarily used for: