🤖 AI Alignment & Safety Language
5 exercise sets. Master the technical vocabulary AI safety researchers and ML engineers use when discussing alignment techniques, red-teaming, and safety evaluations.
RLHF & Training Vocabulary
Reward model, preference labeling, PPO, KL divergence, Constitutional AI vocabulary.
AI Red-Teaming Language
Red-team, jailbreak, prompt injection, adversarial probing, capability evaluation vocabulary.
Alignment Benchmarks
Alignment benchmark, capability elicitation, sandbagging, sycophancy in model evaluation.
Safety Properties
Corrigibility, scalable oversight, interpretability, faithfulness, deceptive alignment vocabulary.
AI Safety Communication
Communicating safety findings, safety incident escalation, responsible disclosure in AI vocabulary.