English for LLM Evaluation: Vocabulary Every AI Engineer Needs
Learn the English vocabulary for LLM evaluation: MMLU, HumanEval, BLEU, ROUGE, BERTScore, hallucination, ground truth, and judge LLMs for AI model assessment.
Evaluating large language models rigorously is one of the hardest and most important challenges in applied AI engineering. From academic benchmarks to production quality metrics, a precise shared vocabulary has emerged that allows researchers and engineers to compare models, track regressions, and communicate results credibly. This guide covers the English terms you need to read evaluation papers, present results to stakeholders, and design your own evaluation systems.
Key Vocabulary
Benchmark — a standardised evaluation suite consisting of tasks and reference answers against which model performance can be measured and compared across different models or training runs. Definition sentence: A benchmark provides a reproducible, apples-to-apples comparison, though no single benchmark captures all the capabilities that matter for a given application. Example: “The model scored 72.4% on the MMLU benchmark, placing it in the top tier of publicly available models at that parameter count.”
MMLU (Massive Multitask Language Understanding) — a benchmark covering 57 subjects from elementary mathematics to professional law and medicine, used to assess a model’s breadth of knowledge and reasoning ability. Definition sentence: MMLU is a multiple-choice benchmark, so its scores reflect a model’s ability to select correct answers from four options across a very wide range of domains. Example: “We used MMLU to compare our fine-tuned model against the base model, looking specifically at the professional medicine and clinical knowledge sub-categories.”
HumanEval — a benchmark of 164 Python programming problems used to assess a model’s ability to write functionally correct code, with correctness determined by running test cases. Definition sentence: Unlike text-based benchmarks, HumanEval verifies correctness programmatically, making it a more objective measure of coding ability. Example: “Our model achieved a pass@1 of 68% on HumanEval, meaning it produced a working solution on the first attempt for more than two-thirds of the problems.”
BLEU (Bilingual Evaluation Understudy) — a metric originally designed for machine translation that measures how closely a generated text matches one or more reference texts by counting overlapping n-grams. Definition sentence: BLEU scores range from 0 to 1, with higher scores indicating greater surface-level similarity to the reference, though high BLEU does not guarantee semantic correctness. Example: “BLEU scores for our summarisation model were in line with the baseline, but human evaluators consistently preferred our outputs — a reminder that BLEU has significant limitations.”
BERTScore — an evaluation metric that computes semantic similarity between generated and reference texts by comparing their contextual BERT embeddings, capturing meaning beyond surface word overlap. Definition sentence: BERTScore correlates more strongly with human judgements than BLEU on many tasks because it understands paraphrases and synonyms rather than requiring exact n-gram matches. Example: “We switched from BLEU to BERTScore for our abstractive summarisation evaluation after finding that BLEU was penalising valid paraphrases.”
Hallucination — the phenomenon in which a language model generates plausible-sounding but factually incorrect or entirely fabricated information with apparent confidence. Definition sentence: Hallucination is a systematic failure mode of LLMs, not a random error, and it is especially dangerous in high-stakes domains such as medicine, law, and finance. Example: “The model hallucinated three case citations that do not exist in any legal database — the outputs were fluent and convincing but completely fabricated.”
Ground truth — the correct, authoritative answer against which a model’s output is compared during evaluation; typically derived from human annotation, verified databases, or held-out labelled test sets. Definition sentence: Without reliable ground truth, it is impossible to measure accuracy or detect regressions with any rigour. Example: “We built our ground truth dataset by having three domain experts independently annotate 500 questions and resolve disagreements through discussion.”
LLM-as-a-judge — a technique in which a powerful language model — often GPT-4 or Claude — evaluates and scores the outputs of another model, serving as a scalable proxy for human annotation. Definition sentence: LLM-as-a-judge enables evaluation at scale without the cost and latency of human reviewers, though it introduces its own biases and failure modes that must be audited. Example: “We use an LLM-as-a-judge pipeline that scores responses on a 1–5 scale for helpfulness, accuracy, and safety, then writes those scores back to our tracing system.”
Useful Phrases
- “Our internal eval harness runs MMLU and a set of domain-specific benchmarks on every model checkpoint, flagging any drop in score greater than one standard deviation.”
- “The pass@1 metric on HumanEval tells you how often the model gets it right on the first try without any sampling — it is the most conservative measure of coding ability.”
- “We report BERTScore F1 rather than BLEU because our task involves abstractive summarisation where paraphrasing is expected and desirable.”
- “Before we trust the LLM-as-a-judge scores, we validate the judge’s outputs against a held-out human-annotated set to quantify its own error rate.”
- “The hallucination rate dropped from 14% to 6% after we added retrieval augmentation — we measure it by cross-referencing claims against a verified knowledge base.”
Common Mistakes
Saying “the model is hallucinating” as if it is an active choice
Hallucination is not intentional behaviour — the model does not “decide” to fabricate. The correct framing is a passive or systemic one: “the model produced a hallucination”, “the output contains hallucinated content”, or “hallucination is a known failure mode.” Anthropomorphising the model’s errors can mislead stakeholders about the nature of the problem.
Conflating “benchmark” and “metric”
A benchmark is a full evaluation suite — a collection of tasks and reference answers. A metric is the mathematical formula used to score outputs, such as BLEU or accuracy. You evaluate a model on a benchmark and using a metric. Saying “we ran the BLEU benchmark” is incorrect; BLEU is a metric. The correct form is “we evaluated the model on our summarisation dataset using BLEU.”
Misusing “ground truth” in the plural
In technical English, ground truth is typically treated as an uncountable noun, similar to data or information. You collect ground truth, not ground truths. When you need to refer to individual items, say “ground-truth labels”, “ground-truth answers”, or “ground-truth annotations.” Saying “we have 500 ground truths” is understandable but non-standard; prefer “we have 500 ground-truth examples.”
Evaluation is the discipline that separates genuine progress from impressive-sounding claims, and the vocabulary in this guide is the shared language of that discipline. As LLM evaluation continues to mature — with new benchmarks, better judge models, and more nuanced metrics emerging regularly — keeping your English terminology precise will ensure your results are understood, trusted, and reproducible by collaborators anywhere in the world.