Why this matters: Evaluating AI models requires precise language. Whether you write model cards, interpret leaderboard rankings, design eval pipelines, or present results to stakeholders, you need to discuss benchmarks, metrics, and limitations clearly and accurately.

Key evaluation vocabulary

Benchmarks & metrics

  • "The model scores 89.1% on MMLU, placing it in the top tier."
  • "Benchmark saturation occurs when models approach ceiling performance."
  • "We measure pass@k for code generation tasks."

Model cards & evaluation

  • "The model card documents intended use and out-of-scope use."
  • "Hallucination rate was measured on a held-out factual QA set."
  • "We used LLM-as-judge for scalable open-ended evaluation."

Leaderboards & communication

  • "Suspicion of benchmark contamination led to an independent audit."
  • "The Elo rating reflects pairwise win rates across 100k comparisons."
  • "This improvement is practically significant for production latency."