Prompt regression testing, prompt versioning, A/B prompt comparison, golden datasets, eval harnesses, PromptFoo, and Braintrust vocabulary.
Key vocabulary
Prompt regression testing — running a fixed set of test cases against a prompt after every change, to verify that new edits do not break previously working behaviour.
Prompt versioning — tracking prompt changes with version identifiers (v1, v2…) so you can reproduce results and roll back if quality degrades.
Golden dataset — a curated set of inputs with known correct outputs used as the ground truth for evaluating prompts.
Eval harness — the infrastructure (code + datasets + metrics) that runs evaluations automatically and reports results.
A/B prompt comparison — running two prompt variants on the same inputs and comparing outputs to determine which performs better.
0 / 5 completed
1 / 5
A team runs their full prompt test suite after every PR that modifies a system prompt. This practice is called:
Prompt regression testing treats prompts like code: every change is verified against a known-good test suite. This catches accidental quality regressions — for example, adding a new instruction that fixes one case but breaks another. Tools like PromptFoo and Braintrust make this easy to integrate into CI/CD pipelines.
2 / 5
What is a golden dataset in the context of prompt evaluation?
A golden dataset contains carefully curated (input, expected output) pairs. When you run your prompt against the golden set, the eval harness compares actual outputs to expected ones. Building and maintaining a golden dataset is one of the most important investments in prompt engineering — without it, you are making changes blind.
3 / 5
Your team stores prompt changes as v1.0, v1.1, v2.0 in a prompt registry. This practice is:
Prompt versioning applies software version control concepts to prompts. It allows you to: reproduce results from a specific prompt version, roll back to a previous version if a new one degrades performance, and run A/B comparisons between versions. Tools like LangSmith, PromptLayer, and Braintrust provide prompt registries with built-in versioning.
4 / 5
PromptFoo is described as an eval harness for prompts. What does this mean?
An eval harness is the automation layer for prompt evaluation. PromptFoo is an open-source CLI tool that lets you define test cases in YAML, run them against multiple models or prompt variants, and score results with built-in or custom metrics (exact match, LLM-as-judge, regex, etc.). Braintrust is a similar but hosted platform with dataset management and tracing.
5 / 5
In an A/B prompt comparison, what are you trying to determine?
An A/B prompt comparison runs prompt variant A and variant B against the same input set, then compares results using a metric (win rate, human preference, task accuracy, etc.). This is how prompt engineers make evidence-based decisions: "Prompt B outperforms Prompt A on 73% of test cases, so we will deploy B." It prevents intuition-only prompt changes from going to production.