English for ML Model Evaluation Discussions
Learn the vocabulary of machine learning model evaluation: precision/recall, AUC-ROC, BLEU/ROUGE, LLM-as-judge, RAGAS, hallucination rate, red-teaming, and benchmark saturation.
Evaluating machine learning models — especially large language models — requires a specialised vocabulary. When your team discusses whether a model is ready for production, the conversation involves metrics, trade-offs, and evaluation frameworks that have specific meanings. Misusing these terms can lead to confusion or poor decisions. This post covers the essential vocabulary for ML evaluation discussions, with particular attention to the LLM evaluation space.
Key Vocabulary
Precision and Recall — Two foundational classification metrics that are often in tension. Precision is the proportion of positive predictions that are actually correct. Recall is the proportion of actual positives that the model correctly identifies. Example: “We tuned the model to favour higher recall at the expense of precision because missing a fraud case is worse than a false alarm.”
F1 score — The harmonic mean of precision and recall. Useful when you need a single number that balances both. Example: “The F1 score improved from 0.78 to 0.85 after fine-tuning on the domain-specific dataset.”
AUC-ROC — Area Under the Receiver Operating Characteristic Curve. Measures a classifier’s ability to discriminate between classes across all possible thresholds. A score of 1.0 is perfect; 0.5 is random. Example: “The AUC-ROC of 0.92 suggests the model is a strong discriminator even before we tune the decision threshold.”
BLEU / ROUGE — Automated metrics for evaluating text generation. BLEU (Bilingual Evaluation Understudy) measures n-gram overlap between generated and reference text, commonly used for translation. ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is used for summarisation. Example: “The BLEU score improved by 4 points, but human evaluators still preferred the previous model’s outputs.”
LLM-as-judge — A method where a large language model is used to evaluate the outputs of another model (or itself). Example: “We use GPT-4 as a judge to score our model’s responses for coherence and factual accuracy at scale.”
RAGAS framework — A framework for evaluating Retrieval-Augmented Generation systems. It measures metrics such as faithfulness, answer relevancy, and context precision. Example: “Our RAGAS faithfulness score dropped after we switched the retrieval model, which suggests the generated answers are hallucinating details not in the retrieved context.”
Hallucination rate — The proportion of model outputs that contain factually incorrect or fabricated information. Example: “The hallucination rate on medical queries is 12%, which is too high for a clinical deployment.”
Grounding accuracy — The degree to which a model’s output is supported by the source documents it retrieved. Example: “We measure grounding accuracy by checking whether each claim in the answer can be traced to the retrieved context.”
Safety eval — Evaluation focused on whether a model produces harmful, biased, or policy-violating outputs. Example: “We run safety evals on every model checkpoint before promoting it to production.”
Red-teaming — Adversarial testing where evaluators deliberately try to cause a model to produce harmful or undesired outputs. Example: “The red-team discovered that the model could be jailbroken with a specific role-play prompt.”
Benchmark saturation — The point at which models score so highly on a benchmark that it no longer meaningfully differentiates between them. Example: “GPT-4 level performance has caused benchmark saturation on MMLU — we need harder evaluation sets.”
How to Use This in Practice
In evaluation discussions, you will often need to discuss trade-offs between metrics. The precision-recall trade-off is the most common: “We can increase recall by lowering the confidence threshold, but that will hurt precision and increase false positives.”
When discussing LLM evaluations specifically, distinguish between automated metrics (BLEU, ROUGE, RAGAS scores) and human evaluation. Automated metrics are scalable but imperfect. Human evaluation is the gold standard but expensive. Teams often use automated metrics as a first filter and human evaluation for final decisions.
Use red-teaming results to communicate concrete risks: “Red-teaming identified three jailbreak vectors that bypass the content filter. We have mitigations for two of them.”
Example Conversation
ML Engineer (Vira): “The new model has a BLEU score 6 points higher than baseline, but our RAGAS faithfulness score actually dropped.”
Research Lead: “That’s a concerning pattern. Higher BLEU might mean it sounds more fluent, but lower faithfulness suggests it’s generating content that isn’t grounded in the retrieved context.”
Vira: “Exactly — the hallucination rate is up 4%. I’d recommend against promoting this checkpoint until we understand why faithfulness degraded.”
Research Lead: “Agreed. Let’s also run a safety eval and a red-team pass before we revisit promotion.”
Practice Tips
-
Read a model card: Major models on Hugging Face include model cards that discuss evaluation metrics. Read the evaluation section of a model card for a model you know (like BERT or LLaMA) and try to explain three of the metrics in the card using your own words.
-
Practise the precision-recall trade-off: Think of a real-world classification problem (spam detection, fraud detection, medical diagnosis). Write two sentences explaining why you would prioritise precision over recall, or vice versa, for that specific use case.
-
Follow the RAGAS documentation: The RAGAS framework has public documentation and examples. Work through one example evaluation and try to explain the output metrics — faithfulness, answer relevancy, context recall — to a colleague in plain English.