RLHF Vocabulary Guide: Human Feedback, Reward Models, and Annotation Language

Master the English vocabulary used in RLHF pipelines — preference pairs, reward models, annotation guidelines, and inter-annotator agreement for AI engineers.

Working in RLHF Requires Precise English

Reinforcement Learning from Human Feedback (RLHF) has become a standard technique for aligning large language models. Engineers, researchers, and annotation quality specialists working on RLHF pipelines communicate in a specialised vocabulary that sits at the intersection of machine learning, data labelling, and experimental design.

If you work in this space and English is not your first language, this guide gives you the terminology and the context to use it confidently.


Core RLHF Pipeline Vocabulary

TermDefinition
Preference pairA pair of model outputs shown to an annotator, who selects the preferred one
Comparison dataThe dataset of preference pairs collected from annotators
Reward modelA neural network trained to predict human preferences, producing a scalar reward signal
Reward signalThe numerical value output by a reward model, used to guide policy training
PolicyThe language model being fine-tuned via reinforcement learning
Reference modelThe frozen pre-trained model used as a baseline to constrain policy updates
KL divergenceA measure of how far the policy has drifted from the reference model
CalibrationThe process of aligning a model’s confidence scores to actual accuracy rates

The preference pair is the atomic unit of RLHF data. An annotator sees two completions for the same prompt and picks the better one. The quality of your reward model is directly constrained by the quality of the preference annotations — which is why annotation guidelines and quality control matter so much.


Annotation Pipeline Vocabulary

TermDefinition
Annotation guidelineA document instructing annotators on how to label data for a specific task
Task instructionThe specific prompt given to an annotator for a single annotation job
Label schemaThe set of possible labels or ratings an annotator can assign
RubricA structured scoring framework with criteria and examples for each score level
Edge caseA scenario that is difficult to label because it falls outside the guideline’s main cases
Annotator biasSystematic differences in how a particular annotator labels data versus others
Gold standardA set of examples with known correct labels, used to calibrate annotators
Calibration setA sample of annotations reviewed together to align annotator understanding

When writing annotation guidelines, the word “should” is ambiguous — does it mean “must” or “is preferred”? In guideline writing, use “must” for requirements and “prefer” or “favour” for best practices. This distinction reduces annotation errors significantly.


Annotation Quality Vocabulary

TermDefinition
Inter-annotator agreement (IAA)The degree to which independent annotators produce the same labels
Cohen’s kappa (κ)A statistical measure of IAA that accounts for chance agreement; ranges from -1 to 1
Fleiss’ kappaAn extension of Cohen’s kappa for more than two annotators
Intraclass correlation (ICC)A measure of agreement for continuous ratings
AdjudicationThe process of resolving disagreements between annotators, often by a senior reviewer
Consensus labellingA label determined by majority vote among multiple annotators
Annotation throughputThe number of items labelled per annotator per unit of time
Label noiseIncorrect or inconsistent labels in a training dataset

A Cohen’s kappa of 0.6–0.8 is considered substantial agreement; above 0.8 is near-perfect. When discussing IAA scores with colleagues, contextualise the number: “Our kappa is 0.71, which is substantial, but we see a notable drop on adversarial examples — those need revised guidance.”


Reward Model Training Vocabulary

TermDefinition
Reward hackingWhen a policy learns to exploit weaknesses in the reward model rather than align with true intent
Goodhart’s Law”When a measure becomes a target, it ceases to be a good measure” — describes reward hacking
OveroptimisationExcessive optimisation for the reward model, causing the policy to degrade in real quality
RegularisationTechniques (such as KL penalty) that constrain the policy to prevent overoptimisation
Human preference distributionThe distribution of true human preferences the reward model is trying to approximate

Example Sentences

  1. “The reward model is overfitting to surface features of the preference pairs — verbose responses are getting high scores regardless of factual accuracy.”
  2. “Our inter-annotator agreement dropped from 0.74 to 0.61 after we introduced the new rubric; I suspect the helpfulness dimension is ambiguous and needs worked examples.”
  3. “Before we run the next calibration session, let’s review the edge cases where annotators most frequently disagree and update the guidelines accordingly.”
  4. “The KL divergence between the policy and the reference model has been increasing over training — we may need to strengthen the regularisation coefficient.”
  5. “Adjudication of the most contested preference pairs should go to the domain expert reviewer, not the general pool, to preserve label quality.”

Common Register Notes

When presenting RLHF work to a mixed audience of ML engineers and product stakeholders, avoid assuming familiarity with statistical terms. Replace “our kappa is 0.7” with “our annotators agree roughly 70% of the time after accounting for random chance, which is considered strong agreement for this type of task.”

The word “alignment” is used both in the technical sense (aligning model outputs to human preferences) and in the broader AI safety sense (ensuring AI systems behave safely). Clarify which sense you mean when the context could be ambiguous.