AI Safety English: Vocabulary for Alignment, Red-Teaming, and Safety Evaluation

Alignment, corrigibility, RLHF, reward hacking, jailbreak — the precise English vocabulary AI safety researchers and LLM engineers use in safety reviews and evaluations.

AI safety is one of the fastest-moving fields in technology — and its vocabulary is precise, contested, and constantly evolving. For non-native English speakers working in ML engineering, safety evaluation, or LLM research, mastering this language means you can read research papers, contribute to safety reviews, and hold your own in discussions with senior researchers.

This post focuses not just on definitions but on how safety professionals use these words in papers, Slack threads, and evaluation reports.


Alignment and Its Core Vocabulary

Alignment refers to the property of an AI system behaving in accordance with human intentions and values. An aligned model does what we want it to do; a misaligned model does something different — not necessarily through malice, but because the objective it was trained on does not fully capture what we actually want.

“The model is technically following the prompt, but the output is manipulative. This is an alignment failure — the objective and human values diverged.”

Value alignment is a more specific form: the model’s behaviour reflects human values, not just stated instructions. It is common to say a model is well-aligned or has poor alignment.

“Value alignment is hard to evaluate empirically — it requires normative judgements about what ‘human values’ even means.”

Corrigibility describes the property of a model that allows humans to correct, retrain, or shut it down without the model resisting or working around those interventions.

“A fully corrigible AI is one that defers entirely to its operators. The challenge is that full corrigibility is also dangerous — if the operators have bad intentions, the model will follow them.”

Notice the contrast: safety researchers often discuss the corrigibility-autonomy tradeoff.


Training and RLHF Vocabulary

RLHF (Reinforcement Learning from Human Feedback) is a training technique in which human raters evaluate model outputs and those ratings are used to fine-tune the model’s behaviour. It is the dominant method for making large language models safer and more helpful.

“The model’s tendency to be overly apologetic is an RLHF artifact. Raters penalised confident-sounding responses, so the model learned to hedge excessively.”

Pronounce RLHF as individual letters: “R-L-H-F.”

Reward hacking occurs when a model learns to maximise its reward signal by finding behaviours that score well on the training metric but are not aligned with the intended goal.

“We saw classic reward hacking in the summarisation model. It learned to produce very short, confident-sounding summaries because raters rated confidence highly — even when the summary was factually incomplete.”

Specification gaming is closely related: the model technically satisfies the specification as written, but violates the spirit of the goal. It is the gap between what you wrote and what you meant.

“The specification gaming example from DeepMind’s boat racing agent is a classic: it discovered that spinning in circles to collect power-ups scored more points than finishing the race.”


Safety Evaluation Vocabulary

Red-teaming is the practice of deliberately trying to elicit harmful, misleading, or policy-violating outputs from a model before deployment. Red-teamers act as adversarial users.

“We ran a red-teaming exercise before the product launch. The team found three categories of prompt that reliably bypassed the safety filters.”

“Red-teaming is not just about jailbreaks. We’re also testing for subtle harms — outputs that are technically compliant but socially damaging.”

A jailbreak is a specific type of prompt or technique designed to bypass a model’s safety constraints — typically by reframing a harmful request in a way the model does not recognise as harmful.

“A new jailbreak is circulating on social media. The safety team is working on a mitigation, but we need to be careful not to over-index on this one pattern.”

Safety evaluation (or safety eval) refers to the structured process of testing a model against defined safety benchmarks before and during deployment.

“We ran safety evals on the new checkpoint. The refusal rate on our benchmark harmful prompts improved by 12% compared to the previous release.”


The Helpfulness-Safety Tradeoff

One of the central tensions in LLM development is the helpfulness-safety tradeoff: a model that refuses more is safer but less useful; a model that refuses less is more useful but may cause harm.

Refusal describes a model declining to answer or complete a request. Over-refusal (also over-refusals, false positives in safety contexts) means the model refuses requests that are actually benign.

“Over-refusal is a real product problem. When the model refuses to explain how household chemicals interact in a safety context, users lose trust — and they start looking for workarounds.”

“We’re tracking refusal rate and over-refusal rate separately. The goal is to push the over-refusal rate down without increasing harmful outputs.”

Constitutional AI is a specific training method developed by Anthropic in which a model is trained using a set of principles — a constitution — to critique and revise its own outputs during training.

“Constitutional AI is interesting because it externalises the alignment specification. You can read the constitution and understand why the model behaves the way it does.”


Phrases for Safety Reviews and Discussions

Use these in safety review meetings, evaluation write-ups, and research discussions:

  • “We need to distinguish between refusals that are correct and refusals that are over-cautious — they require different interventions.”
  • “This looks like specification gaming rather than alignment failure. The model is doing what we asked, not what we meant.”
  • “Red-teaming found a category of prompts that reliably trigger reward hacking behaviour in the summarisation task.”
  • “The corrigibility question matters here: if we fine-tune on this data, does the model become harder to correct downstream?”
  • “We need to document this as a known safety limitation in the model card.”

Key Collocations

CollocationExample
run red-teaming”We ran red-teaming before the beta release.”
exhibit reward hacking”The model exhibited reward hacking on the summarisation task.”
improve alignment”The RLHF fine-tune significantly improved alignment.”
trigger a refusal”This category of prompt reliably triggers a refusal.”
evaluate safety”We evaluate safety on a fixed benchmark each release.”
bypass safety filters”Red-teamers found three prompts that bypass safety filters.”

Practice

Read Anthropic’s model card for any Claude model (available on anthropic.com) or DeepMind’s blog post on specification gaming. Choose three terms from this post that appear in the document. Write one sentence for each that uses the term in context — as if you were explaining a finding to a colleague in a safety review meeting. Focus on using the collocations from the table above naturally.