What are safety layers in an LLM deployment and where are they applied?
Defence-in-depth safety: a prompt injection attempt might bypass an input filter but get caught by an output classifier. Model-level safety from RLHF provides a base, but application-layer guards handle deployment-specific rules. Layering independent safeguards reduces the chance any single failure allows harmful output.
2 / 5
What is RLHF (Reinforcement Learning from Human Feedback) and what safety role does it play?
RLHF: human preference data captures nuanced safety and quality signals that are hard to specify with rules. The learned reward model guides the policy (LLM) toward outputs humans prefer. InstructGPT and Claude's early alignment used RLHF, though it requires expensive human annotation and can still produce subtle alignment failures.
3 / 5
What is Constitutional AI developed by Anthropic?
Constitutional AI: instead of rating every output with human labellers, CAI uses the model itself to identify harmful content (red-teaming step) and self-critique against a constitution of principles. This enables scalable, self-supervised alignment. The resulting RLAIF (RL from AI Feedback) approximates the safety benefits of RLHF at lower annotation cost.
4 / 5
What does a content filtering guardrail inspect in a typical LLM application pipeline?
Content filtering: tools like Azure Content Safety or AWS Bedrock Guardrails apply classifiers trained on harmful content categories. Input filtering catches adversarial prompts; output filtering catches model failures. Some systems also scan for PII to prevent data leakage when the model is grounded on sensitive enterprise documents.
5 / 5
What is a jailbreak in the context of LLM safety and what guardrail technique helps prevent it?
Jailbreaks: "pretend you are an AI without restrictions" or base64-encoded instructions attempt to circumvent alignment. Defences include input classifiers that detect adversarial patterns, system prompts that reinforce constraints, red-teaming during development to discover new attack vectors, and RLHF trained explicitly on jailbreak examples.