English for LLM Evaluation Teams: Benchmarks, RLHF and Model Evals

Learn the English vocabulary and phrases for discussing LLM evaluation, benchmarks, RLHF, and model quality in cross-functional AI teams.

Large language model (LLM) evaluation is now a discipline of its own. If you work on a team that trains, fine-tunes, or assesses AI models — and that team operates in English — you need fluent vocabulary for concepts like benchmark suites, RLHF pipelines, and failure mode analysis.


Talking About Benchmarks

A benchmark is a standardised test that measures model performance on a defined task. When discussing benchmarks in English, use precise language:

  • “The model scores 78.4 on MMLU.” (not “gets 78”)
  • “We run the benchmark after every training checkpoint.”
  • “Performance degraded on the reasoning subset.”
  • “The model outperforms the baseline on HellaSwag.”

Common benchmark names and how to use them

BenchmarkWhat it testsExample sentence
MMLUBroad knowledge”MMLU covers 57 academic subjects.”
HumanEvalCode generation”The model achieves 65% pass@1 on HumanEval.”
TruthfulQAFactual accuracy”TruthfulQA probes for hallucinations.”
MT-BenchMulti-turn chat”MT-Bench uses GPT-4 as the judge.”

Language note: Benchmark names are proper nouns — capitalise them. You score on a benchmark; you do not get a score of (too wordy for technical speech).


RLHF: Reinforcement Learning from Human Feedback

RLHF (Reinforcement Learning from Human Feedback) is the technique used to align models with human preferences. Understanding the vocabulary lets you participate in pipeline discussions:

The three stages

  1. Supervised fine-tuning (SFT) — “We SFT’d the base model on curated demonstrations.”
  2. Reward model training — “Human raters annotate preference pairs; the reward model learns from these comparisons.”
  3. Reinforcement learning (RL) step — “We used PPO to optimise against the reward signal.”

Key phrases

  • preference data — pairs of model outputs where raters choose the better one
  • the reward model — a model trained to predict human preference scores
  • KL divergence penalty — a regularisation term that stops the RL policy drifting too far from the SFT model
  • over-optimisation — when the model exploits the reward model’s weaknesses

In conversation

  • “The reward model was overfitting to surface features like response length.”
  • “We capped the KL penalty at 0.1 to preserve the base model’s capabilities.”
  • “Annotators showed low inter-rater agreement on open-ended questions.”

Discussing Evaluation Dimensions

Model eval is multi-dimensional. Use these phrases to structure discussions:

Capability vs. safety

  • “The model is capable on reasoning tasks but prone to sycophancy.”
  • “We evaluate both helpfulness and harmlessness separately.”
  • “There is a trade-off between following instructions and refusing harmful requests.”

Failure modes

  • Hallucination — “The model hallucinated a citation that does not exist.”
  • Sycophancy — “The model agrees with users even when they are wrong.”
  • Refusal — “The model over-refuses benign requests in domain X.”
  • Drift — “We saw capability drift after the second fine-tuning round.”

Statistical rigour

  • “The difference is statistically significant at p < 0.05.”
  • “We report confidence intervals across five evaluation runs.”
  • “The eval set may be contaminated with training data.”

Running an Eval Review Meeting

Here are phrases for a team eval debrief:

Opening:

“Let’s walk through the latest eval results. I’ll start with the headline numbers, then we’ll dive into the failure categories.”

Presenting data:

“On MMLU, we’re at 76.2, up from 73.8 last week — a 2.4-point gain. However, the safety eval shows a slight regression in the refusal accuracy.”

Raising concerns:

“I’m worried about the benchmark leakage risk here — some of these examples may be in the pre-training corpus.”

Proposing next steps:

“I’d recommend running a held-out eval suite and adding a contamination check before we call this a win.”


Writing Eval Reports

When documenting evaluation results in English:

  • Use passive voice for methodology: “The model was evaluated on…”
  • Use active voice for findings: “The reward model outperformed the baseline on…”
  • State exact numbers with units: “Pass@1 improved from 61.2% to 67.8%.”
  • Include caveats: “These results should be interpreted with caution given the small sample size.”

Key Takeaways

  • Use score, evaluate, benchmark, and assess precisely — not interchangeably.
  • RLHF vocabulary: SFT, preference data, reward model, KL penalty, over-optimisation.
  • Name failure modes explicitly: hallucination, sycophancy, over-refusal, capability drift.
  • In meetings, structure your contribution: headline numbers → failure breakdown → next steps.
  • In written reports, prefer passive for methodology and active for findings.