5 exercises — Learn the vocabulary of AI red-teaming: jailbreaks, prompt injection, adversarial probing, and safety evaluation.
0 / 5 completed
1 / 5
In AI safety, red-teaming refers to:
Red-teaming in AI is borrowed from cybersecurity — a team (or automated system) tries to elicit harmful, incorrect, or unsafe outputs to identify weaknesses before deployment.
2 / 5
A user crafts a prompt that causes the model to ignore its system instructions and reveal confidential data. This is called:
Prompt injection is an attack where malicious input — often in user-controlled text — attempts to override or hijack the model's system instructions, causing it to act against its design.
3 / 5
A jailbreak in the context of AI models is:
Jailbreaks exploit weaknesses in a model's safety training to make it produce outputs it was trained to refuse — e.g., role-play framings, hypothetical scenarios, or obfuscated prompts.
4 / 5
The team reports overrefusal as a problem with the new model. What does this mean?
Overrefusal (also called over-alignment or over-restriction) happens when safety tuning is too aggressive, causing the model to refuse legitimate, harmless requests — reducing helpfulness without a safety benefit.
5 / 5
Which sentence correctly uses adversarial probing in context?
Adversarial probing is systematic testing using carefully crafted inputs to map where a model's safety or capability boundaries lie — it is methodical, not random, and aims to expose weaknesses.