AI Alignment & Safety Language Exercises — English for AI Engineers

Vocabulary and language exercises for technical AI safety and alignment: RLHF, red-teaming, benchmarks, safety properties, and risk communication.

Advanced

RLHF & Preference Learning — Vocabulary

Reward models, PPO, DPO, Constitutional AI, reward hacking — the core RLHF pipeline vocabulary.
Advanced

AI Red-Teaming — Vocabulary

Jailbreaks, prompt injection, adversarial probing, overrefusal — vocabulary for testing model safety.
Advanced

Alignment Benchmarks & Evaluation — Vocabulary

Sycophancy, sandbagging, TruthfulQA, HHH — the evaluation vocabulary of the alignment field.
Advanced

AI Safety Properties — Vocabulary

Corrigibility, scalable oversight, interpretability, deceptive alignment — core AI safety concepts.
Advanced

Communicating AI Safety Findings — Language

Alignment tax, responsible disclosure, risk tiers, dual-use — language for AI safety reports.