Reading a Benchmark Report — Comprehension Exercises
Read the benchmark report excerpt below, then answer comprehension questions about its content, terminology, and what the numbers mean.
📄 PASSAGE — Read carefully before answering
Apex-7B Evaluation Report — Summary of Results
This report presents evaluation results for three instruction-tuned language models: Apex-7B, Meridian-13B, and Solaris-34B. All models were evaluated on two benchmarks: MMLU (Massive Multitask Language Understanding) and HumanEval, a code generation benchmark measuring functional correctness.
MMLU measures general knowledge across 57 academic subjects including mathematics, law, and medicine, scored as percentage accuracy. Apex-7B achieved 71.4%, Meridian-13B achieved 74.1%, and Solaris-34B achieved 78.9%.
On HumanEval, which tests code generation by asking models to complete Python functions such that all unit tests pass, Apex-7B scored 52.3% (pass@1), Meridian-13B scored 61.8%, and Solaris-34B scored 58.4%.
All MMLU results use 5-shot evaluation: each question is preceded by five worked examples drawn from a fixed pool, giving models context before they answer. HumanEval uses 0-shot evaluation with temperature 0.
Confidence intervals (95%) for all scores are ±1.2 percentage points. Differences smaller than this margin should not be treated as meaningful. Full per-subject breakdowns are available in Appendix B.
Question 1 of 4