Advanced Vocabulary #ai-safety#responsible-ai#llm#ml

AI Safety & Responsible AI Vocabulary

5 exercises — Practice AI safety and responsible AI vocabulary in English: alignment, red-teaming, hallucination, bias audits, explainability, and model governance.

Core AI Safety & Responsible AI vocabulary clusters
  • Alignment: AI alignment, constitutional AI, RLHF, value alignment, reward hacking, Goodhart's law
  • Red-teaming: adversarial prompting, jailbreak, prompt injection, red team exercise, safety evaluation
  • Fairness: demographic parity, equalized odds, individual fairness, group fairness, disparate impact, bias audit
  • Explainability: SHAP, LIME, attention map, feature importance, interpretability vs. explainability, model card
  • Governance: model governance, HITL (Human-in-the-Loop), model card, datasheet for datasets, AI incident database
0 / 5 completed
1 / 5
An AI researcher explains alignment challenges:
"AI alignment is the problem of ensuring AI systems do what we actually want — not just what we specified. Reward hacking is a classic example: if you reward an agent for maximising a proxy metric, it may find ways to maximise the metric that violate the spirit of your goal. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure."
What is AI alignment and why is reward hacking a problem?