Advanced 6 topic areas 58+ exercises

AI Evaluation Engineer

AI Evaluation Engineers design and run the processes that measure whether AI systems are safe, accurate, and fit for purpose. Their daily English covers writing evaluation protocols, communicating benchmark results to product and leadership teams, documenting failure modes, and facilitating red-team exercises. They must translate highly technical evaluation findings into clear risk assessments. This path builds the vocabulary for discussing evaluation rigor, communicating uncertainty, and advocating for quality standards.

Topics covered

  • Benchmark design
  • Human evaluation protocols
  • Red-teaming & adversarial testing
  • Evaluation metrics
  • Model quality reporting
  • Responsible AI evaluation

Vocabulary spotlight

4 terms every AI Evaluation Engineer should know in English:

benchmark n.

A standardised dataset and evaluation protocol used to measure model performance on a specific task — enabling comparison across model versions or providers

"Our internal benchmark covers 500 real user queries — it is more predictive of production quality than public benchmarks."
red-teaming n.

An adversarial testing practice where evaluators deliberately try to elicit harmful, incorrect, or policy-violating outputs from an AI model

"Red-teaming found three jailbreak patterns that bypassed the content guardrails before the model launched."
evaluation rubric n.

A structured scoring guide that defines what constitutes different quality levels for a given evaluation dimension — used to align human raters

"The evaluation rubric for "helpfulness" defines five levels, with example responses at each level to calibrate annotators."
inter-rater agreement n.

A metric measuring how consistently different human evaluators assign the same scores — low agreement indicates an ambiguous rubric or genuinely subjective task

"Cohen's kappa of 0.72 indicates substantial inter-rater agreement — the rubric is well-calibrated."
Open full glossary →

📚 Vocabulary Reference

Key terms organised by category for AI Evaluation Engineers:

Evaluation Design

benchmarkevaluation rubrictest setgold datasetannotation guidelinetask definitionevaluation harnessoffline evaluationonline evaluationA/B evaluation

Human Evaluation

human raterinter-rater agreementCohen's kappaLikert scalepairwise comparisonpreference ratingSBS (side-by-side)annotationcalibration sessionlabelling tool

Adversarial Testing

red-teamingjailbreakprompt injectionadversarial promptattack vectorfailure modesafety evaluationharm categorypolicy violationedge case

Metrics & Reporting

pass rateregressionquality deltaconfidence intervalstatistical significanceLLM-as-judgewin ratehallucination raterefusal ratetoxicity score
Study full vocabulary modules →

Recommended exercises

Real-world scenarios you'll practise

  • Writing an evaluation protocol document: specifying the benchmark tasks, annotation guidelines, scoring rubric, and success criteria for a new model launch
  • Presenting benchmark results to a product team: explaining what the numbers mean, what confidence to place in them, and what risks remain
  • Facilitating a red-team session: briefing participants, coordinating attack categories, and writing up findings in a structured report
  • Explaining model regression to an engineering team: a new model version scored higher on public benchmarks but lower on internal human evaluation

Recommended reading

Explore another role

⚡ Developer Experience Engineer

Open path →