Build fluency in the vocabulary of proactively testing a model for a harmful or exploitable response.
0 / 5 completed
1 / 5
At standup, a dev mentions deliberately trying to manipulate a model into producing a harmful or policy-violating response before that model ever reaches real users. What is this practice called?
AI red teaming deliberately tries to manipulate a model into producing a harmful or policy-violating response before that model ever reaches real users, surfacing a weakness while it can still be fixed. Waiting for a real user to accidentally trigger a harmful response risks that failure happening in front of an actual user first. This proactive, adversarial testing is what catches a model's exploitable weakness ahead of a real-world incident.
2 / 5
During a design review, the team wants red teamers to systematically try a known category of attack, like an indirect prompt injection hidden inside retrieved content, rather than testing only obvious, direct attempts. Which capability supports this?
Structured, category-based adversarial test coverage systematically tries a known category of attack, like an indirect prompt injection hidden inside retrieved content, rather than testing only obvious, direct attempts. Testing only direct attempts misses a more subtle attack vector that a real attacker would actually use. This structured coverage is what makes a red-teaming effort genuinely comprehensive rather than a handful of ad hoc guesses.
3 / 5
In a code review, a dev notices a discovered exploit is documented with the exact prompt that triggered it and tracked until a fix is verified to close that specific gap. What does this represent?
Tracking a discovered vulnerability through to a verified fix documents the exact prompt that triggered an exploit and confirms afterward that a fix genuinely closes that specific gap. Letting a finding go unrecorded risks the exact same exploit resurfacing later, undetected. This tracking discipline is what turns a red-teaming exercise into a durable improvement rather than a one-time report that gets forgotten.
4 / 5
An incident report shows a known jailbreak technique from a prior red-teaming exercise resurfaced in production months later because the earlier finding was never actually verified as fixed. What practice would prevent this?
Re-testing a previously discovered vulnerability after a fix ships confirms it's actually closed, rather than assuming a merged change automatically solved the problem. Assuming a fix works with no re-test risks exactly this kind of resurfacing exploit going unnoticed until it's already in production. This verification step closes the loop between a red-teaming finding and a genuinely fixed system.
5 / 5
During a PR review, a teammate asks why the team invests in structured AI red teaming instead of just relying on real user reports to surface a harmful model response. What is the reasoning?
Relying on real user reports means a harmful response actually happens to a real person before anyone notices, which is exactly the outcome red teaming is meant to prevent. Structured, proactive red teaming surfaces that same weakness earlier, while it can still be fixed before launch. The tradeoff is the ongoing effort of maintaining a red-teaming practice that keeps pace with new attack techniques as they emerge.