Advanced Vocabulary #ai-safety#responsible-ai#llm#ml

AI Safety & Responsible AI Vocabulary

5 exercises — Practice AI safety and responsible AI vocabulary in English: alignment, red-teaming, hallucination, bias audits, explainability, and model governance.

Core AI Safety & Responsible AI vocabulary clusters

Alignment: AI alignment, constitutional AI, RLHF, value alignment, reward hacking, Goodhart's law
Red-teaming: adversarial prompting, jailbreak, prompt injection, red team exercise, safety evaluation
Fairness: demographic parity, equalized odds, individual fairness, group fairness, disparate impact, bias audit
Explainability: SHAP, LIME, attention map, feature importance, interpretability vs. explainability, model card
Governance: model governance, HITL (Human-in-the-Loop), model card, datasheet for datasets, AI incident database

0 / 5 completed

1 / 5

An AI researcher explains alignment challenges:
"AI alignment is the problem of ensuring AI systems do what we actually want — not just what we specified. Reward hacking is a classic example: if you reward an agent for maximising a proxy metric, it may find ways to maximise the metric that violate the spirit of your goal. Goodhart's Law: when a measure becomes a target, it ceases to be a good measure."
What is AI alignment and why is reward hacking a problem?

2 / 5

A security team introduces AI red-teaming:
"Before we deploy this LLM-powered feature, we're running a red team exercise. Red-teamers try to make the model produce harmful outputs through adversarial prompting. They test for jailbreaks — inputs that bypass the model's safety guardrails. They also test for prompt injection: malicious content in retrieved documents that tries to override the system prompt and hijack the model's behaviour."
What is the difference between a jailbreak and prompt injection?

3 / 5

An ML engineer presents fairness metrics to leadership:
"We audited our hiring recommendation model for bias. Demographic parity asks: does the model recommend candidates at the same rate across demographic groups? Equalized odds asks: does the model make errors at the same rate across groups — both false positives and false negatives? These metrics sometimes conflict — improving one can worsen the other."
What is the difference between demographic parity and equalized odds?

4 / 5

A data scientist explains model explainability to a product team:
"Our fraud model uses SHAP values to explain individual predictions. SHAP tells us how much each feature contributed to this prediction — positively or negatively. For this transaction, the model flagged it as fraud because the location was unusual (+0.4 impact) and the amount was 10× the user's average (+0.3 impact), but the device was known (-0.1 impact, slightly reducing the fraud score)."
What is the difference between interpretability and explainability in ML?

5 / 5

An AI governance lead introduces model governance practices:
"Every model we deploy to production goes through our model governance process. The model card documents: what the model does, who it's for, what data it was trained on, its performance metrics, known limitations, and which populations it may underperform for. Human-in-the-loop reviews are required for high-stakes decisions. We also subscribe to the AI Incident Database to learn from others' failures."
What is a model card and why is it important for responsible AI?