English for LLM Evaluation Teams: Benchmarks, RLHF and Model Evals
Learn the English vocabulary and phrases for discussing LLM evaluation, benchmarks, RLHF, and model quality in cross-functional AI teams.
Large language model (LLM) evaluation is now a discipline of its own. If you work on a team that trains, fine-tunes, or assesses AI models — and that team operates in English — you need fluent vocabulary for concepts like benchmark suites, RLHF pipelines, and failure mode analysis.
Talking About Benchmarks
A benchmark is a standardised test that measures model performance on a defined task. When discussing benchmarks in English, use precise language:
- “The model scores 78.4 on MMLU.” (not “gets 78”)
- “We run the benchmark after every training checkpoint.”
- “Performance degraded on the reasoning subset.”
- “The model outperforms the baseline on HellaSwag.”
Common benchmark names and how to use them
| Benchmark | What it tests | Example sentence |
|---|---|---|
| MMLU | Broad knowledge | ”MMLU covers 57 academic subjects.” |
| HumanEval | Code generation | ”The model achieves 65% pass@1 on HumanEval.” |
| TruthfulQA | Factual accuracy | ”TruthfulQA probes for hallucinations.” |
| MT-Bench | Multi-turn chat | ”MT-Bench uses GPT-4 as the judge.” |
Language note: Benchmark names are proper nouns — capitalise them. You score on a benchmark; you do not get a score of (too wordy for technical speech).
RLHF: Reinforcement Learning from Human Feedback
RLHF (Reinforcement Learning from Human Feedback) is the technique used to align models with human preferences. Understanding the vocabulary lets you participate in pipeline discussions:
The three stages
- Supervised fine-tuning (SFT) — “We SFT’d the base model on curated demonstrations.”
- Reward model training — “Human raters annotate preference pairs; the reward model learns from these comparisons.”
- Reinforcement learning (RL) step — “We used PPO to optimise against the reward signal.”
Key phrases
- preference data — pairs of model outputs where raters choose the better one
- the reward model — a model trained to predict human preference scores
- KL divergence penalty — a regularisation term that stops the RL policy drifting too far from the SFT model
- over-optimisation — when the model exploits the reward model’s weaknesses
In conversation
- “The reward model was overfitting to surface features like response length.”
- “We capped the KL penalty at 0.1 to preserve the base model’s capabilities.”
- “Annotators showed low inter-rater agreement on open-ended questions.”
Discussing Evaluation Dimensions
Model eval is multi-dimensional. Use these phrases to structure discussions:
Capability vs. safety
- “The model is capable on reasoning tasks but prone to sycophancy.”
- “We evaluate both helpfulness and harmlessness separately.”
- “There is a trade-off between following instructions and refusing harmful requests.”
Failure modes
- Hallucination — “The model hallucinated a citation that does not exist.”
- Sycophancy — “The model agrees with users even when they are wrong.”
- Refusal — “The model over-refuses benign requests in domain X.”
- Drift — “We saw capability drift after the second fine-tuning round.”
Statistical rigour
- “The difference is statistically significant at p < 0.05.”
- “We report confidence intervals across five evaluation runs.”
- “The eval set may be contaminated with training data.”
Running an Eval Review Meeting
Here are phrases for a team eval debrief:
Opening:
“Let’s walk through the latest eval results. I’ll start with the headline numbers, then we’ll dive into the failure categories.”
Presenting data:
“On MMLU, we’re at 76.2, up from 73.8 last week — a 2.4-point gain. However, the safety eval shows a slight regression in the refusal accuracy.”
Raising concerns:
“I’m worried about the benchmark leakage risk here — some of these examples may be in the pre-training corpus.”
Proposing next steps:
“I’d recommend running a held-out eval suite and adding a contamination check before we call this a win.”
Writing Eval Reports
When documenting evaluation results in English:
- Use passive voice for methodology: “The model was evaluated on…”
- Use active voice for findings: “The reward model outperformed the baseline on…”
- State exact numbers with units: “Pass@1 improved from 61.2% to 67.8%.”
- Include caveats: “These results should be interpreted with caution given the small sample size.”
Key Takeaways
- Use score, evaluate, benchmark, and assess precisely — not interchangeably.
- RLHF vocabulary: SFT, preference data, reward model, KL penalty, over-optimisation.
- Name failure modes explicitly: hallucination, sycophancy, over-refusal, capability drift.
- In meetings, structure your contribution: headline numbers → failure breakdown → next steps.
- In written reports, prefer passive for methodology and active for findings.