LLM Evaluation Vocabulary: Benchmarks, Metrics, and Model Cards
MMLU, HumanEval, perplexity, hallucination rate, LLM-as-judge, model card, data contamination — the vocabulary you need to read, discuss, and run LLM evaluations in English.
Evaluating large language models is a field with its own dense vocabulary — and the stakes are high. Whether you are choosing a model for production, reviewing a research paper, or building an evaluation harness for your AI feature, you need to understand the terminology precisely. Misreading a benchmark or confusing precision with perplexity can lead to costly decisions.
Benchmarks
Benchmark — A standardised dataset and task used to measure model performance in a reproducible, comparable way. Phrase: “The model’s benchmark scores look impressive, but check whether those tasks overlap with its training data.”
MMLU (Massive Multitask Language Understanding) — A benchmark testing knowledge across 57 academic subjects: maths, medicine, law, history, and more. A common measure of general knowledge. Phrase: “The model achieves 89% on MMLU — strong overall, but the breakdown by subject matters more.”
HumanEval — A benchmark of Python coding problems, testing whether a model can generate functionally correct code. Pass@k (the probability that at least one of k samples passes the test suite) is the standard metric. Phrase: “HumanEval pass@1 is 72% — it passes the test suite on the first attempt about three-quarters of the time.”
GPQA (Graduate-Level Google-Proof Q&A) — A benchmark of very difficult science questions that require graduate-level reasoning and cannot be answered by simple web search. A test of deep reasoning capability.
Eval harness — A software framework for running standardised evaluations against multiple models. Examples: lm-evaluation-harness (EleutherAI), promptfoo. Phrase: “We run the eval harness nightly — any model update that drops a benchmark by more than 2% triggers a review.”
Data contamination — When training data includes examples from the benchmark test set, inflating scores. A major concern when interpreting benchmark results. Phrase: “The high MMLU score may be due to data contamination — the training corpus wasn’t fully deduplicated against the benchmark.”
Quality Metrics
Perplexity — A measure of how well a language model predicts a text sample. Lower perplexity means the model assigns higher probability to the text and is more “confident.” Useful for comparing models on the same domain. Phrase: “Perplexity dropped from 18 to 12 after fine-tuning on domain data — the model became much more fluent in the target domain.”
Hallucination rate — The proportion of model outputs that contain factually incorrect or fabricated information presented as fact. Phrase: “The hallucination rate on our RAG benchmark is 4% — we need it below 1% for the medical use case.”
Groundedness — A measure of whether a model’s output is supported by the provided context (e.g. retrieved documents in a RAG system). A grounded answer doesn’t introduce facts beyond what the context contains.
Faithfulness — In RAG evaluation: how accurately the generated answer reflects the retrieved context. High faithfulness = the answer sticks to the sources.
Relevance — Whether the retrieved context and the generated answer are relevant to the user’s question.
Precision and recall (in IR context) — Precision: of the retrieved documents, what fraction are relevant. Recall: of all relevant documents, what fraction were retrieved. These trade off against each other in retrieval systems.
Evaluation Methods
Zero-shot vs few-shot eval — Zero-shot: the model answers without any examples in the prompt. Few-shot: the prompt includes a small number of examples before the question. Few-shot performance is often significantly better. Phrase: “We evaluate both zero-shot and three-shot — the gap tells us how much the model benefits from examples.”
LLM-as-judge — Using a strong LLM (e.g. GPT-4) to evaluate the outputs of another model, scoring them for quality, relevance, or accuracy. Scalable but introduces its own biases. Phrase: “We use LLM-as-judge for open-ended generation quality — human annotation is too slow and expensive to run at scale.”
Human preference evaluation — Asking human raters to compare two model outputs and select the preferred one (pairwise comparison). Used to train RLHF models and to validate LLM-as-judge.
A/B eval — Running two model versions on the same inputs and comparing quality metrics. Similar to A/B testing in product development.
Model card — A documentation artifact published alongside a model describing its intended use cases, limitations, evaluation results, training data, and known biases. A standard introduced by Google. Phrase: “Read the model card before deploying — the limitations section often discloses failure modes that benchmarks don’t capture.”
Practice: Pick a model from Hugging Face and read its model card and benchmark results. Write a 150-word summary of the model’s strengths and limitations using the vocabulary from this post — as if you were recommending (or not recommending) it to your team.