RLHF Vocabulary Guide: Human Feedback, Reward Models, and Annotation Language
Master the English vocabulary used in RLHF pipelines — preference pairs, reward models, annotation guidelines, and inter-annotator agreement for AI engineers.
Working in RLHF Requires Precise English
Reinforcement Learning from Human Feedback (RLHF) has become a standard technique for aligning large language models. Engineers, researchers, and annotation quality specialists working on RLHF pipelines communicate in a specialised vocabulary that sits at the intersection of machine learning, data labelling, and experimental design.
If you work in this space and English is not your first language, this guide gives you the terminology and the context to use it confidently.
Core RLHF Pipeline Vocabulary
| Term | Definition |
|---|---|
| Preference pair | A pair of model outputs shown to an annotator, who selects the preferred one |
| Comparison data | The dataset of preference pairs collected from annotators |
| Reward model | A neural network trained to predict human preferences, producing a scalar reward signal |
| Reward signal | The numerical value output by a reward model, used to guide policy training |
| Policy | The language model being fine-tuned via reinforcement learning |
| Reference model | The frozen pre-trained model used as a baseline to constrain policy updates |
| KL divergence | A measure of how far the policy has drifted from the reference model |
| Calibration | The process of aligning a model’s confidence scores to actual accuracy rates |
The preference pair is the atomic unit of RLHF data. An annotator sees two completions for the same prompt and picks the better one. The quality of your reward model is directly constrained by the quality of the preference annotations — which is why annotation guidelines and quality control matter so much.
Annotation Pipeline Vocabulary
| Term | Definition |
|---|---|
| Annotation guideline | A document instructing annotators on how to label data for a specific task |
| Task instruction | The specific prompt given to an annotator for a single annotation job |
| Label schema | The set of possible labels or ratings an annotator can assign |
| Rubric | A structured scoring framework with criteria and examples for each score level |
| Edge case | A scenario that is difficult to label because it falls outside the guideline’s main cases |
| Annotator bias | Systematic differences in how a particular annotator labels data versus others |
| Gold standard | A set of examples with known correct labels, used to calibrate annotators |
| Calibration set | A sample of annotations reviewed together to align annotator understanding |
When writing annotation guidelines, the word “should” is ambiguous — does it mean “must” or “is preferred”? In guideline writing, use “must” for requirements and “prefer” or “favour” for best practices. This distinction reduces annotation errors significantly.
Annotation Quality Vocabulary
| Term | Definition |
|---|---|
| Inter-annotator agreement (IAA) | The degree to which independent annotators produce the same labels |
| Cohen’s kappa (κ) | A statistical measure of IAA that accounts for chance agreement; ranges from -1 to 1 |
| Fleiss’ kappa | An extension of Cohen’s kappa for more than two annotators |
| Intraclass correlation (ICC) | A measure of agreement for continuous ratings |
| Adjudication | The process of resolving disagreements between annotators, often by a senior reviewer |
| Consensus labelling | A label determined by majority vote among multiple annotators |
| Annotation throughput | The number of items labelled per annotator per unit of time |
| Label noise | Incorrect or inconsistent labels in a training dataset |
A Cohen’s kappa of 0.6–0.8 is considered substantial agreement; above 0.8 is near-perfect. When discussing IAA scores with colleagues, contextualise the number: “Our kappa is 0.71, which is substantial, but we see a notable drop on adversarial examples — those need revised guidance.”
Reward Model Training Vocabulary
| Term | Definition |
|---|---|
| Reward hacking | When a policy learns to exploit weaknesses in the reward model rather than align with true intent |
| Goodhart’s Law | ”When a measure becomes a target, it ceases to be a good measure” — describes reward hacking |
| Overoptimisation | Excessive optimisation for the reward model, causing the policy to degrade in real quality |
| Regularisation | Techniques (such as KL penalty) that constrain the policy to prevent overoptimisation |
| Human preference distribution | The distribution of true human preferences the reward model is trying to approximate |
Example Sentences
- “The reward model is overfitting to surface features of the preference pairs — verbose responses are getting high scores regardless of factual accuracy.”
- “Our inter-annotator agreement dropped from 0.74 to 0.61 after we introduced the new rubric; I suspect the helpfulness dimension is ambiguous and needs worked examples.”
- “Before we run the next calibration session, let’s review the edge cases where annotators most frequently disagree and update the guidelines accordingly.”
- “The KL divergence between the policy and the reference model has been increasing over training — we may need to strengthen the regularisation coefficient.”
- “Adjudication of the most contested preference pairs should go to the domain expert reviewer, not the general pool, to preserve label quality.”
Common Register Notes
When presenting RLHF work to a mixed audience of ML engineers and product stakeholders, avoid assuming familiarity with statistical terms. Replace “our kappa is 0.7” with “our annotators agree roughly 70% of the time after accounting for random chance, which is considered strong agreement for this type of task.”
The word “alignment” is used both in the technical sense (aligning model outputs to human preferences) and in the broader AI safety sense (ensuring AI systems behave safely). Clarify which sense you mean when the context could be ambiguous.