AI Benchmark Vocabulary

MMLU, HumanEval, GPQA, MT-Bench, BIG-bench, HellaSwag — how benchmarks work and why saturation matters.

Key vocabulary

  • MMLU (Massive Multitask Language Understanding) — tests knowledge across 57 academic subjects.
  • HumanEval — measures code generation ability via functional correctness of Python solutions.
  • GPQA (Graduate-Level Google-Proof Q&A) — expert-level science questions designed to resist web search.
  • Benchmark saturation — when top models cluster near ceiling performance, making differentiation difficult.
  • Teaching to the test — concern that models are trained on or fine-tuned specifically for benchmark data.
0 / 5 completed
1 / 5
A colleague says “MMLU is saturating.” What does this mean?