Model Evaluation Failure Vocabulary

Practice vocabulary for model evaluation failures: data leakage in evaluation, memorising test examples, benchmark gaming, fine-tuning on evaluation sets, and reliability failures.

0 / 5 completed

1 / 5

The post-mortem reveals ___ leakage in the evaluation: test examples appeared in the training set.

2 / 5

A researcher warns: 'The model ___ test examples.' What does this mean for the benchmark scores?

3 / 5

A critic accuses a lab of ___ gaming: optimising specifically for a benchmark without improving real capability.

4 / 5

An audit finds: 'This model was fine-tuned ___ the evaluation set.' Why is this a serious problem?

5 / 5

The team documents a ___ failure: the model produces correct answers 95% of the time but fails catastrophically on edge cases.