Eval-as-Code Vocabulary

Evaluation harnesses, golden datasets, LLM-as-judge, regression testing, and modern eval tooling vocabulary.

Key vocabulary

  • Evaluation harness — the framework and infrastructure that runs eval cases against a model and collects results.
  • Golden dataset — a curated set of inputs with known correct outputs used as ground truth for evaluation.
  • LLM-as-judge — using a language model (often GPT-4 or Claude) to evaluate another model’s output at scale.
  • Eval suite — a collection of evaluation cases grouped by task type or quality dimension.
  • Regression testing for models — running the eval suite after each model change to catch quality degradations.
0 / 5 completed
1 / 5
Your team uses a “golden dataset” to evaluate a new model version. What is a golden dataset?