AI Benchmark Vocabulary

MMLU, HumanEval, GPQA, MT-Bench, BIG-bench, HellaSwag — how benchmarks work and why saturation matters.

Key vocabulary

MMLU (Massive Multitask Language Understanding) — tests knowledge across 57 academic subjects.
HumanEval — measures code generation ability via functional correctness of Python solutions.
GPQA (Graduate-Level Google-Proof Q&A) — expert-level science questions designed to resist web search.
Benchmark saturation — when top models cluster near ceiling performance, making differentiation difficult.
Teaching to the test — concern that models are trained on or fine-tuned specifically for benchmark data.

0 / 5 completed

1 / 5

A colleague says “MMLU is saturating.” What does this mean?

2 / 5

HumanEval primarily measures a model’s ability to:

3 / 5

GPQA questions are described as “Google-proof.” This means:

4 / 5

A researcher raises concern about a model “teaching to the test” on BIG-bench. What is the concern?

5 / 5

MT-Bench is used to evaluate models on: