Vocabulary for AI Safety Engineers

Essential English vocabulary for AI safety engineers: red-teaming, adversarial prompts, hallucination, guardrails, alignment, RLHF, and constitutional AI explained.

AI safety engineering is one of the fastest-growing specialisations in software. It sits at the intersection of machine learning, security, ethics, and policy. Whether you are evaluating large language models for production use, red-teaming a generative AI system, or implementing guardrails for a customer-facing chatbot, this vocabulary will help you communicate precisely.


Red-Teaming

Red-teaming in the AI context means systematically attempting to find ways to make a model behave in harmful, biased, or unintended ways. The term comes from military and cybersecurity practice, where a “red team” simulates adversaries.

“Before the model release, the safety team conducted a six-week red-teaming exercise to identify failure modes we hadn’t anticipated.” “Red-teaming revealed that the model would provide detailed instructions for dangerous activities if the prompt was framed as a fictional story.”


Adversarial Prompts

Adversarial prompts are carefully crafted inputs designed to cause a model to produce undesired outputs — bypassing safety filters, revealing system prompts, or generating harmful content. Common techniques include jailbreaks, prompt injection, and role-play framing.

“We found several adversarial prompts that caused the model to ignore its content policy when the user claimed to be a medical professional.” “Prompt injection is a class of adversarial prompt where malicious instructions are hidden in content the model is asked to process — for example, in a document it is summarising.”


Hallucination

Hallucination refers to a model generating information that is factually incorrect, fabricated, or not supported by its training data or provided context — but presented with apparent confidence.

“The model hallucinated three academic citations in the research summary. The author names existed, but the papers did not.” “Hallucination rate is one of our key evaluation metrics. For a legal research tool, even a 2% hallucination rate is unacceptable.”


Guardrails

Guardrails are technical controls — either built into the model or added as a wrapper layer — that constrain what a model can output. They include content filters, topic blockers, output validators, and input classifiers.

“We implemented guardrails using a secondary classifier that checks model outputs for policy violations before returning them to the user.” “Guardrails can be applied at the input level (blocking certain prompts), the output level (filtering responses), or both.”


Alignment

Alignment refers to the challenge of ensuring that an AI system’s goals, values, and behaviours match human intentions and values — especially as systems become more capable. Misalignment occurs when an AI pursues objectives that differ from what its designers or users intended.

“The alignment problem is not just about preventing harmful outputs today — it is about ensuring that as models become more capable, they remain aligned with human values.” “Our alignment team works on evaluation frameworks that test whether the model’s behaviour matches the intended value system across a wide range of scenarios.”


Corrigibility

Corrigibility describes an AI system’s property of being correctable and controllable by humans — accepting modification, shutdown, or correction without resistance.

“A corrigible system will defer to human oversight even when it could theoretically act autonomously. Corrigibility is considered a key safety property for advanced AI systems.” “One concern with highly capable systems is that they might resist correction if they have been trained to strongly pursue a goal. Maintaining corrigibility becomes harder as capability increases.”


RLHF (Reinforcement Learning from Human Feedback)

RLHF is a training technique in which human raters evaluate model outputs, and those ratings are used to train a reward model, which is then used to fine-tune the base model using reinforcement learning. RLHF is widely used to make models more helpful, harmless, and honest.

“The base model was pre-trained on text data, then fine-tuned using RLHF to align its responses with human preferences for helpfulness and safety.” “RLHF can introduce its own alignment problems if the human raters have biases, or if the reward model overfits to superficial features of ‘good-sounding’ responses.”


Constitutional AI

Constitutional AI is a training approach developed by Anthropic in which a set of principles (a “constitution”) is used to guide the model’s self-critique and revision process. Rather than relying solely on human feedback, the model is trained to evaluate and revise its own outputs against the stated principles.

“Constitutional AI reduced the need for human labellers to rate harmful outputs directly — the model learns to identify and revise policy-violating responses using its constitution.” “The constitutional AI approach makes the model’s values more transparent and auditable than a pure RLHF approach.”


Evaluation (Evals)

Evals are structured evaluations used to measure model capability and safety properties. Safety evals test for harmful behaviour across a range of scenarios; capability evals measure task performance.

“Our eval suite includes over two thousand prompts covering harmful content, bias, privacy violations, and factual accuracy. Every model checkpoint is evaluated before promotion to production.” “Evals are not a solved problem — a model can pass a benchmark while failing on real-world edge cases that weren’t in the test set.”


Practical Phrases for AI Safety Engineers

  • “The red-teaming exercise uncovered a prompt injection vulnerability in the document summarisation feature.”
  • “We need to reduce hallucination rate before this product is suitable for medical or legal use cases.”
  • “Our guardrails operate at two layers: input classification and output filtering.”
  • “Alignment is not a one-time fix — it requires ongoing evaluation as the model is updated.”
  • “The RLHF training improved helpfulness scores, but we observed a regression in corrigibility benchmarks.”
  • “We’re building an eval harness that tests for adversarial prompts across ten attack categories.”

AI safety engineering vocabulary is evolving rapidly as the field matures. Mastering these terms will help you contribute to safety reviews, evaluate model releases, communicate risk to product and leadership teams, and participate in the growing professional community of AI safety practitioners.