LLM Evaluation Vocabulary: Benchmarks, Metrics, and Model Cards

Evaluating large language models is a field with its own dense vocabulary — and the stakes are high. Whether you are choosing a model for production, reviewing a research paper, or building an evaluation harness for your AI feature, you need to understand the terminology precisely. Misreading a benchmark or confusing precision with perplexity can lead to costly decisions.

Benchmarks

Benchmark — A standardised dataset and task used to measure model performance in a reproducible, comparable way. Phrase: “The model’s benchmark scores look impressive, but check whether those tasks overlap with its training data.”

MMLU (Massive Multitask Language Understanding) — A benchmark testing knowledge across 57 academic subjects: maths, medicine, law, history, and more. A common measure of general knowledge. Phrase: “The model achieves 89% on MMLU — strong overall, but the breakdown by subject matters more.”

HumanEval — A benchmark of Python coding problems, testing whether a model can generate functionally correct code. Pass@k (the probability that at least one of k samples passes the test suite) is the standard metric. Phrase: “HumanEval pass@1 is 72% — it passes the test suite on the first attempt about three-quarters of the time.”

GPQA (Graduate-Level Google-Proof Q&A) — A benchmark of very difficult science questions that require graduate-level reasoning and cannot be answered by simple web search. A test of deep reasoning capability.

Eval harness — A software framework for running standardised evaluations against multiple models. Examples: lm-evaluation-harness (EleutherAI), promptfoo. Phrase: “We run the eval harness nightly — any model update that drops a benchmark by more than 2% triggers a review.”

Data contamination — When training data includes examples from the benchmark test set, inflating scores. A major concern when interpreting benchmark results. Phrase: “The high MMLU score may be due to data contamination — the training corpus wasn’t fully deduplicated against the benchmark.”

Quality Metrics

Perplexity — A measure of how well a language model predicts a text sample. Lower perplexity means the model assigns higher probability to the text and is more “confident.” Useful for comparing models on the same domain. Phrase: “Perplexity dropped from 18 to 12 after fine-tuning on domain data — the model became much more fluent in the target domain.”

Hallucination rate — The proportion of model outputs that contain factually incorrect or fabricated information presented as fact. Phrase: “The hallucination rate on our RAG benchmark is 4% — we need it below 1% for the medical use case.”

Groundedness — A measure of whether a model’s output is supported by the provided context (e.g. retrieved documents in a RAG system). A grounded answer doesn’t introduce facts beyond what the context contains.

Faithfulness — In RAG evaluation: how accurately the generated answer reflects the retrieved context. High faithfulness = the answer sticks to the sources.

Relevance — Whether the retrieved context and the generated answer are relevant to the user’s question.

Precision and recall (in IR context) — Precision: of the retrieved documents, what fraction are relevant. Recall: of all relevant documents, what fraction were retrieved. These trade off against each other in retrieval systems.

Evaluation Methods

Zero-shot vs few-shot eval — Zero-shot: the model answers without any examples in the prompt. Few-shot: the prompt includes a small number of examples before the question. Few-shot performance is often significantly better. Phrase: “We evaluate both zero-shot and three-shot — the gap tells us how much the model benefits from examples.”

LLM-as-judge — Using a strong LLM (e.g. GPT-4) to evaluate the outputs of another model, scoring them for quality, relevance, or accuracy. Scalable but introduces its own biases. Phrase: “We use LLM-as-judge for open-ended generation quality — human annotation is too slow and expensive to run at scale.”

Human preference evaluation — Asking human raters to compare two model outputs and select the preferred one (pairwise comparison). Used to train RLHF models and to validate LLM-as-judge.

A/B eval — Running two model versions on the same inputs and comparing quality metrics. Similar to A/B testing in product development.

Model card — A documentation artifact published alongside a model describing its intended use cases, limitations, evaluation results, training data, and known biases. A standard introduced by Google. Phrase: “Read the model card before deploying — the limitations section often discloses failure modes that benchmarks don’t capture.”

Practice: Pick a model from Hugging Face and read its model card and benchmark results. Write a 150-word summary of the model’s strengths and limitations using the vocabulary from this post — as if you were recommending (or not recommending) it to your team.

LLM Evaluation Vocabulary: Benchmarks, Metrics, and Model Cards

Benchmarks

Quality Metrics

Evaluation Methods

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules

Benchmarks

Quality Metrics

Evaluation Methods

Related Articles

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules