AI Model Evaluation
Vocabulary for evaluating, benchmarking, and communicating the quality and safety of AI and language models.
- Benchmark (AI) /ˈbentʃmɑːk/
A standardised test set used to measure and compare model capabilities on a specific task — reasoning, coding, mathematics, language understanding. Benchmarks enable apples-to-apples comparison across models but can be misleading if models overfit to them or if the benchmark does not represent real-world use.
"We evaluated three models on the same benchmarks before selecting a provider — MMLU for general reasoning, HumanEval for code generation, and our own golden dataset for domain-specific accuracy. Public benchmark scores alone were insufficient; our internal evaluation revealed significant quality differences on domain terms that public benchmarks didn't capture."
- MMLU /em em el juː/
Massive Multitask Language Understanding — a benchmark testing a model's knowledge across 57 academic and professional subjects including law, medicine, mathematics, and computer science. A high MMLU score indicates broad factual knowledge but does not guarantee good reasoning or safe outputs.
"The model scored 89% on MMLU, which is in the top tier for general knowledge breadth. However, when we ran it on medical domain prompts for our use case, hallucination rate on specific drug interactions was unacceptable — MMLU breadth did not predict narrow-domain accuracy."
- HumanEval /ˈhjuːmən ɪˈvæl/
A benchmark of 164 hand-written Python coding problems used to evaluate code generation ability. Published by OpenAI, it measures whether a model can write correct code that passes unit tests. Widely cited for comparing coding capability, but known to be partially contaminated in models trained after 2023.
"The model passed 78% of HumanEval problems on first attempt. For our internal coding assistant evaluation we supplemented this with 200 company-specific problems — private evaluation revealed the model struggled with our codebase conventions, dropping to 54% on internal tests despite the strong public score."
- Model Card /ˈmɒdəl kɑːd/
A structured document accompanying an AI model that describes its intended uses, limitations, evaluation results, training data, ethical considerations, and known failure modes. Introduced by Google as a standard for model transparency. Analogous to a product data sheet for a machine learning model.
"Before deploying the third-party model, our AI governance team reviewed the model card — it documented that the model performed poorly on non-English inputs and had not been evaluated for medical advice contexts. Both were relevant to our use case, so we required the vendor to provide updated evaluation data before approval."
- Hallucination Rate /həˌluːsɪˈneɪʃən reɪt/
The frequency with which a language model generates plausible-sounding but factually incorrect or fabricated information — presenting confident falsehoods. Measured as a percentage of model outputs that contain factually unsupported claims. A critical metric for high-stakes applications like legal, medical, or financial AI tools.
"Our evaluation found a 12% hallucination rate on questions about specific legislation — the model invented plausible-sounding case citations that did not exist. For our legal research tool, we set a maximum acceptable hallucination rate of 2% with source grounding, which required adding a retrieval layer with verified sources."
- Faithfulness /ˈfeɪθfəlnəs/
In RAG (Retrieval-Augmented Generation) systems, faithfulness measures whether the model's generated response is factually grounded in the retrieved context — does the answer contain only claims supported by the source documents? Distinct from answer relevance (is the question answered?) and context relevance (was the right content retrieved?).
"Our RAG system had 95% relevance but only 71% faithfulness — the model was answering questions correctly but occasionally adding information from its training data that was not in the retrieved context. We added a faithfulness evaluation step using an LLM judge that flagged outputs with claims not traceable to retrieved passages."
- BLEU Score /bluː skɔː/
Bilingual Evaluation Understudy — an automated metric for evaluating machine translation and text generation quality by comparing n-gram overlap between generated text and reference translations. Ranges from 0 to 1 (or 0–100). A useful proxy but known to correlate poorly with human judgement at the sentence level.
"Our translation model achieves BLEU 38 on the WMT benchmark, which is competitive for general translation. However, for technical documentation translation, BLEU scores were misleading — correct use of domain terminology matters more than n-gram overlap with a generic reference, so we supplemented with human expert evaluation on 500 technical samples."
- ROUGE /ruːʒ/
Recall-Oriented Understudy for Gisting Evaluation — a family of metrics for evaluating text summarisation by measuring n-gram overlap between a generated summary and reference summaries. ROUGE-1 (unigram), ROUGE-2 (bigram), and ROUGE-L (longest common subsequence) are the most commonly reported variants.
"The summarisation model scored ROUGE-L 0.42 on the CNN/DailyMail benchmark. When we evaluated on internal customer support tickets, ROUGE scores were lower but human evaluators rated the summaries higher — ROUGE penalised paraphrasing, even when the semantic meaning was accurately captured."
- Perplexity /pəˈpleksɪti/
A measure of how well a language model predicts a sample of text — lower perplexity means the model is less "surprised" by the text, indicating better fluency and language modelling ability. Used primarily to evaluate base model quality and compare models on the same corpus. Not a direct measure of accuracy or helpfulness.
"The fine-tuned model had perplexity of 8.3 on our domain corpus, compared to 14.7 for the base model — confirming that fine-tuning improved language fluency for our specific domain. However, we found perplexity improvements did not directly translate to task accuracy improvements, requiring separate task-specific evaluation."
- LLM-as-Judge /el el em əz dʒʌdʒ/
A methodology that uses a large language model to evaluate the outputs of another model — scoring quality, accuracy, relevance, or safety rather than relying on human raters for every evaluation. Enables scalable, automated evaluation at lower cost than human review, but introduces bias if the judge model has preferences correlated with the evaluated model.
"We use GPT-4 as judge to evaluate our model's customer support responses at scale — rating each response on accuracy, tone, and completeness. We validated the judge by comparing its ratings to 500 human-rated examples; agreement was 87%, good enough for automated quality monitoring but we still do weekly human audits on edge cases."
- Golden Dataset /ˈɡəʊldən ˈdeɪtəˌset/
A curated, high-quality evaluation dataset with verified correct answers, used as the authoritative test set for measuring model quality on a specific task. Building a good golden dataset requires domain expert annotation and careful quality control — the quality of evaluation is only as good as the quality of the golden set.
"We built a golden dataset of 2,000 annotated customer queries with verified correct answers, reviewed by domain experts. It is the single source of truth for our model evaluation pipeline — every model candidate must score above 85% on the golden dataset before consideration for production deployment."
- Benchmark Contamination /ˈbentʃmɑːk kənˌtæmɪˈneɪʃən/
The phenomenon where training data for a model includes the questions or answers from evaluation benchmarks — causing inflated benchmark scores that do not reflect real generalisation ability. A significant problem for models trained after popular benchmarks were published publicly.
"We discovered potential benchmark contamination when the model scored 92% on MMLU but only 61% on our held-out private evaluation set — a 31-point gap is a strong signal that benchmark questions appeared in training data. We now prioritise private evaluation on proprietary datasets over public benchmark comparison."
Quick Quiz — AI Model Evaluation
Test yourself on these 12 terms. You'll answer 10 multiple-choice questions — each shows a term, you pick the correct definition.
What does this term mean?