Reading a Benchmark Report — Comprehension Exercises

📄 PASSAGE — Read carefully before answering

Apex-7B Evaluation Report — Summary of Results

This report presents evaluation results for three instruction-tuned language models: Apex-7B, Meridian-13B, and Solaris-34B. All models were evaluated on two benchmarks: MMLU (Massive Multitask Language Understanding) and HumanEval, a code generation benchmark measuring functional correctness.

MMLU measures general knowledge across 57 academic subjects including mathematics, law, and medicine, scored as percentage accuracy. Apex-7B achieved 71.4%, Meridian-13B achieved 74.1%, and Solaris-34B achieved 78.9%.

On HumanEval, which tests code generation by asking models to complete Python functions such that all unit tests pass, Apex-7B scored 52.3% (pass@1), Meridian-13B scored 61.8%, and Solaris-34B scored 58.4%.

All MMLU results use 5-shot evaluation: each question is preceded by five worked examples drawn from a fixed pool, giving models context before they answer. HumanEval uses 0-shot evaluation with temperature 0.

Confidence intervals (95%) for all scores are ±1.2 percentage points. Differences smaller than this margin should not be treated as meaningful. Full per-subject breakdowns are available in Appendix B.

Question 1 of 4

Exercise complete!