MMLU, HumanEval, GPQA, MT-Bench, BIG-bench, HellaSwag — how benchmarks work and why saturation matters.
Key vocabulary
MMLU (Massive Multitask Language Understanding) — tests knowledge across 57 academic subjects.
HumanEval — measures code generation ability via functional correctness of Python solutions.
GPQA (Graduate-Level Google-Proof Q&A) — expert-level science questions designed to resist web search.
Benchmark saturation — when top models cluster near ceiling performance, making differentiation difficult.
Teaching to the test — concern that models are trained on or fine-tuned specifically for benchmark data.
0 / 5 completed
1 / 5
A colleague says “MMLU is saturating.” What does this mean?
Benchmark saturation means top models cluster near the ceiling, making it hard to distinguish which is truly better. MMLU saw this after GPT-4 and competitors all reached the high 80s–90s range. Saturated benchmarks lose their utility as differentiators.
2 / 5
HumanEval primarily measures a model’s ability to:
HumanEval presents Python function stubs with docstrings. A model’s solution passes if it produces correct output on all hidden test cases. The key metric is pass@k — whether at least one of k generated solutions passes. It focuses on functional correctness, not style.
3 / 5
GPQA questions are described as “Google-proof.” This means:
GPQA (Graduate-Level Google-Proof Q&A) contains expert-level biology, chemistry, and physics questions where even PhD specialists struggle. “Google-proof” means web search does not easily reveal the answer — testing genuine reasoning, not retrieval.
4 / 5
A researcher raises concern about a model “teaching to the test” on BIG-bench. What is the concern?
“Teaching to the test” (also called benchmark overfitting or contamination) is the worry that training data includes benchmark questions, or that fine-tuning specifically targets benchmark performance. A model that scores high this way may not generalise to real-world tasks.
5 / 5
MT-Bench is used to evaluate models on:
MT-Bench (Multi-Turn Benchmark) evaluates chat models on two-turn conversations across eight categories including reasoning, coding, math, and writing. It uses GPT-4 as an automated judge, scoring responses 1–10. It was introduced alongside LMSYS Chatbot Arena to benchmark instruction-following chat models.