📊 AI Model Evaluation Language
6 exercise sets — 30 exercises. Vocabulary for benchmarks, metrics, leaderboards, and evaluation communication.
- Advanced
AI Benchmark Vocabulary
MMLU, HumanEval, GPQA, MT-Bench, BIG-bench, HellaSwag — what benchmarks measure and why benchmark saturation matters.
- Advanced
Model Card Writing Language
Hugging Face model card structure: intended use, limitations, bias reporting, ethical considerations, and version changelogs.
- Advanced
Evaluation Metrics Vocabulary
BLEU, ROUGE, perplexity, win rate, pass@k, F1, hallucination rate, faithfulness — automatic and human evaluation vocabulary.
- Advanced
AI Leaderboard & Ranking Vocabulary
LMSYS Chatbot Arena, Elo ratings, HELM, Open LLM Leaderboard, benchmark contamination, and gaming concerns.
- Advanced
Eval-as-Code Vocabulary
Evaluation harnesses, golden datasets, LLM-as-judge, eval pipelines, regression testing, Braintrust, Langfuse, PromptFoo.
- Advanced
Communicating Model Performance
How to present evaluation results to stakeholders: confidence intervals, practical significance, and result framing vocabulary.
Key evaluation vocabulary
Benchmarks & metrics
- "The model scores 89.1% on MMLU, placing it in the top tier."
- "Benchmark saturation occurs when models approach ceiling performance."
- "We measure pass@k for code generation tasks."
Model cards & evaluation
- "The model card documents intended use and out-of-scope use."
- "Hallucination rate was measured on a held-out factual QA set."
- "We used LLM-as-judge for scalable open-ended evaluation."
Leaderboards & communication
- "Suspicion of benchmark contamination led to an independent audit."
- "The Elo rating reflects pairwise win rates across 100k comparisons."
- "This improvement is practically significant for production latency."