Practice vocabulary for model evaluation failures: data leakage in evaluation, memorising test examples, benchmark gaming, fine-tuning on evaluation sets, and reliability failures.
0 / 5 completed
1 / 5
The post-mortem reveals ___ leakage in the evaluation: test examples appeared in the training set.
Data leakage in evaluation occurs when information from the test set contaminates the training process — either through direct overlap or through preprocessing steps that use statistics from the full dataset. Results become unreliable.
2 / 5
A researcher warns: 'The model ___ test examples.' What does this mean for the benchmark scores?
When a model memorises test examples, it has seen them during training and simply recalls the answers rather than demonstrating genuine generalisation. Benchmark scores are inflated and do not reflect real-world performance.
3 / 5
A critic accuses a lab of ___ gaming: optimising specifically for a benchmark without improving real capability.
Benchmark gaming means optimising a model specifically to score well on a known benchmark — through targeted fine-tuning, prompt engineering for that test, or selecting evaluation examples — without improving the model's actual capability.
4 / 5
An audit finds: 'This model was fine-tuned ___ the evaluation set.' Why is this a serious problem?
Fine-tuning on the evaluation set means the model has been trained using the very data meant to assess it. This completely invalidates the evaluation — the model has effectively 'seen the answers' and cannot be fairly assessed on that benchmark.
5 / 5
The team documents a ___ failure: the model produces correct answers 95% of the time but fails catastrophically on edge cases.
A reliability failure means the model is not dependable in production — high average accuracy masks dangerous failures on edge cases, adversarial inputs, or distribution shifts. Reliability evaluation requires stress testing beyond standard benchmarks.