Prompt Testing Vocabulary

Prompt regression testing, prompt versioning, A/B prompt comparison, golden datasets, eval harnesses, PromptFoo, and Braintrust vocabulary.

Key vocabulary

Prompt regression testing — running a fixed set of test cases against a prompt after every change, to verify that new edits do not break previously working behaviour.
Prompt versioning — tracking prompt changes with version identifiers (v1, v2…) so you can reproduce results and roll back if quality degrades.
Golden dataset — a curated set of inputs with known correct outputs used as the ground truth for evaluating prompts.
Eval harness — the infrastructure (code + datasets + metrics) that runs evaluations automatically and reports results.
A/B prompt comparison — running two prompt variants on the same inputs and comparing outputs to determine which performs better.

0 / 5 completed

1 / 5

A team runs their full prompt test suite after every PR that modifies a system prompt. This practice is called:

2 / 5

What is a golden dataset in the context of prompt evaluation?

3 / 5

Your team stores prompt changes as v1.0, v1.1, v2.0 in a prompt registry. This practice is:

4 / 5

PromptFoo is described as an eval harness for prompts. What does this mean?

5 / 5

In an A/B prompt comparison, what are you trying to determine?