AI Alignment & Safety Language Exercises — English for AI Engineers
Vocabulary and language exercises for technical AI safety and alignment: RLHF, red-teaming, benchmarks, safety properties, and risk communication.
- Advanced
RLHF & Preference Learning — Vocabulary
Reward models, PPO, DPO, Constitutional AI, reward hacking — the core RLHF pipeline vocabulary.
- Advanced
AI Red-Teaming — Vocabulary
Jailbreaks, prompt injection, adversarial probing, overrefusal — vocabulary for testing model safety.
- Advanced
Alignment Benchmarks & Evaluation — Vocabulary
Sycophancy, sandbagging, TruthfulQA, HHH — the evaluation vocabulary of the alignment field.
- Advanced
AI Safety Properties — Vocabulary
Corrigibility, scalable oversight, interpretability, deceptive alignment — core AI safety concepts.
- Advanced
Communicating AI Safety Findings — Language
Alignment tax, responsible disclosure, risk tiers, dual-use — language for AI safety reports.