Reading Benchmark Fine Print — Comprehension Exercises

📄 PASSAGE — Read carefully before answering
Evaluation Protocol — Methodology Notes (Section 4.2)

All evaluations in this report follow a standardised protocol to enable fair comparison across model versions. HumanEval is run in zero-shot mode: no examples are provided before the task prompt. The model receives only the function signature and docstring and must generate a correct implementation. This tests the model's ability to understand and solve problems without demonstration.

MMLU is run at 5-shot, meaning five correctly answered examples from the same subject area precede each question. Researchers have shown that varying the shot count from 0 to 5 can shift accuracy by up to 8 percentage points on some subjects, which means results are only comparable when the same shot count is used.

Known limitation: This evaluation set was publicly released in 2021. Several high-performing models in this comparison were trained on data collected after that date, and some training corpora are known to include web crawls containing the benchmark questions and answers. This is an instance of benchmark saturation — where a benchmark loses discriminative power because many models achieve near-ceiling performance, partly through exposure.

Researchers planning to use these scores to compare models trained at different times, or with different data transparency, should account for this limitation explicitly.
Question 1 of 4
Exercise complete!