Advanced AI Agents #guardrails #safety #HITL #prompt-injection

Agent Guardrails & Safety

5 exercises — master the vocabulary of safe agentic systems: input and output guardrails, human-in-the-loop patterns, runaway tool call prevention, and the complete guardrail stack.

0 / 5 completed

Agent guardrails vocabulary quick reference

Input guardrails — safety checks on user input before it reaches the agent (prompt injection, PII, policy)
Output guardrails — safety checks on agent output before it reaches the user (harm, PII leakage, hallucinations)
Prompt injection — user input that attempts to override agent instructions
Human-in-the-loop (HITL) — agent pauses for human approval before high-risk/irreversible actions
Runaway tool calls — agent in loop making unbounded tool calls; causes cost/data risk
Max-calls limit — guardrail capping total tool invocations per run
Guardrail stack — layered combination of input, loop, output, and HITL guardrails

1 / 5

What are "input guardrails" in an AI agent system?

2 / 5

What are "output guardrails" in an agentic system?

Output guardrails are the last safety gate — they check what goes OUT before it reaches users or downstream systems.

What output guardrails check:

① Harmful content — violence, hate speech, dangerous instructions
② PII leakage — did the agent accidentally include a customer's personal data in the response?
③ Hallucination detection — does the response contain claims unsupported by the retrieved context?
④ Format compliance — did the agent return valid JSON when a structured output was required?
⑤ Confidential data — did the agent include internal system prompt content or secrets?

Input vs Output guardrail comparison:

	Input guardrail	Output guardrail
Applied to	User request / input	Agent response / output
Prevents	Prompt injection, PII input	Harmful output, PII leakage
Trigger	Before agent runs	Before user sees response

Key vocabulary:
• Output filtration — removing or redacting problematic content from the response
• Hallucination check — verifying claims in the output against source documents
• Format validator — ensuring agent output matches an expected schema (JSON, Markdown, etc.)

3 / 5

What is "human-in-the-loop" (HITL) in the context of AI agents?

HITL is the safety pattern that prevents agents from taking unrecoverable actions autonomously.

When to use HITL:

Action type	HITL appropriate?
Querying a database (read-only)	Usually no
Generating a draft email	Usually no
Sending an email to 10,000 users	Yes — irreversible
Executing a payment transaction	Yes — high-risk
Deleting production database records	Yes — irreversible

HITL implementation patterns:
① Interrupt-and-wait — agent pauses, sends a summary of proposed action, waits for approve/reject
② Approval queue — high-risk actions go into a queue; human reviews asynchronously
③ Confirmation required — agent presents the plan and only proceeds on explicit "yes"

Trade-off: HITL increases safety but reduces autonomy; must be calibrated by action risk level.

Key vocabulary:
• Irreversible action — an action that cannot be undone once taken (key HITL trigger)
• High-risk action — an action with potentially large negative consequences
• Approval gate — the point in the workflow where HITL review occurs

4 / 5

Why are "runaway tool calls" a significant safety and cost concern in production agent systems?

5 / 5

A team is deploying an AI agent that manages customer communications. The CTO asks during review: "What gates prevent this agent from spamming thousands of customers?"

The architects propose three controls: ① a max-sends-per-run limit of 5; ② HITL approval required for any send to more than 50 recipients; ③ an output guardrail that checks email content for urgency manipulation. Which umbrella term covers all three?

All three measures are agent guardrails — the safety layer that wraps around the agent's capabilities.

Breaking down the three guardrails by type:

Control	Guardrail type	What it prevents
Max 5 sends per run	Tool call limit	Runaway bulk sends in a loop
HITL at 50+ recipients	Human-in-the-loop	Unauthorised mass communications
Content urgency check	Output guardrail	Manipulative or harmful tone reaching customers

The guardrail stack for production agents:

User Input → [Input Guardrails] → Agent Loop → [Tool Call Limits]
                                                     ↓
         User ← [Output Guardrails] ← Response ← [HITL Gates for high-risk actions]

Key vocabulary:
• Guardrail stack — the layered combination of input, loop, output, and HITL guardrails
• Safety review — a structured evaluation of an agent's guardrail coverage before production deployment
• Blast radius — the potential scale of harm if the agent takes unintended action (used to calibrate guardrail strength)