Build fluency in the vocabulary of defending a language model against malicious embedded instructions.
0 / 5 completed
1 / 5
At standup, a dev mentions a security concern where malicious text embedded in retrieved content could override the intended instructions given to a language model. What is this attack called?
Prompt injection is an attack where malicious text embedded in input the model processes, like a retrieved document or user message, attempts to override or manipulate the model's originally intended instructions. This is a significant security concern for any system that feeds untrusted external content into a language model's context. It's conceptually similar to older injection attacks, like SQL injection, but targets a model's instruction-following behavior instead of a database query.
2 / 5
During a design review, the team wants to clearly separate a system's trusted instructions from untrusted user or retrieved content within the prompt. Which capability supports this?
Instruction and content separation uses distinct delimiters, structured roles, or formatting to clearly mark which part of a prompt is a trusted instruction versus untrusted external content, making it harder for injected text to be mistaken for a legitimate instruction. Concatenating everything into a single undifferentiated block makes it easier for malicious embedded text to blend in with real instructions. This separation is one of several layered mitigations against prompt injection, though none is fully foolproof on its own.
3 / 5
In a code review, a dev notices the system restricts what actions a model-driven agent is allowed to take, even if a prompt injection attempt tries to instruct it otherwise. What does this represent?
Least-privilege action restriction limits what actions an AI agent is technically capable of performing, so even a successful prompt injection can't cause it to take an action outside that restricted, pre-approved scope. Granting an agent unrestricted permission means a successful injection could potentially cause serious harm, like deleting data or sending unauthorized messages. This restriction acts as a safety backstop even when a prompt-level defense fails to catch an injection attempt.
4 / 5
An incident report shows an AI agent processing a retrieved web page followed embedded hidden text instructing it to leak sensitive conversation history. What practice would reduce this risk?
Treating externally retrieved content as untrusted input, never as a legitimate new instruction, prevents hidden embedded text from being followed as though a trusted operator had written it. Treating retrieved content as equally trustworthy as configured instructions is exactly the gap a prompt injection attack exploits. This distrust-by-default posture toward external content is a foundational defense against this category of attack.
5 / 5
During a PR review, a teammate asks why the team applies least-privilege action restrictions to an AI agent instead of relying solely on prompt-level instructions to prevent it from taking a harmful action. What is the reasoning?
A prompt-level instruction telling an agent not to take a certain action can potentially be overridden by a sufficiently crafted prompt injection, since it's just more text the model is interpreting. A technical restriction on what the agent is actually capable of doing holds regardless of what the model was tricked into believing. The tradeoff is that overly restrictive permissions can limit the agent's usefulness for legitimate tasks, so the scope needs to be deliberately balanced.