RLHF and Annotation Quality: English for Human Feedback Pipelines
Learn the English vocabulary for RLHF pipelines — inter-annotator agreement, kappa scores, calibration sessions, preference pairs, and quality control for human feedback.
RLHF: Where Machine Learning Meets Human Judgement
Reinforcement Learning from Human Feedback (RLHF) is the training technique used to align large language models with human preferences. It powers the conversational quality of modern AI assistants. Behind every aligned model is a large annotation operation — teams of human annotators providing the preference signals the model learns from. If you work in AI engineering, data science, or operations, you will encounter a specific vocabulary for discussing annotation quality. This guide covers the essential terms.
What Is RLHF?
In RLHF, human annotators compare pairs of model outputs and indicate which one is better according to defined criteria. These preference pairs feed into a reward model, which is then used to fine-tune the language model via reinforcement learning.
Preference pair — a set of two (or more) model-generated responses to the same prompt, labelled by an annotator to indicate which is preferred. “Each annotator reviews 30 preference pairs per hour at the target quality level.”
Reward model — a model trained on the annotated preference pairs to predict which outputs humans would prefer. “The reward model is the bridge between human labels and the reinforcement learning signal.”
Annotation Quality Vocabulary
Inter-Annotator Agreement (IAA)
Inter-annotator agreement measures the degree to which different annotators give the same label to the same item. High IAA indicates that the annotation guidelines are clear and the task is well-defined. Low IAA suggests ambiguity in the task, poorly trained annotators, or genuinely subjective judgement areas.
IAA is expressed as a metric. The most common for categorical tasks is Cohen’s kappa (κ):
- κ > 0.80 — almost perfect agreement
- κ 0.61–0.80 — substantial agreement
- κ 0.41–0.60 — moderate agreement
- κ below 0.40 — poor agreement, indicating a problem
“Our IAA on safety classifications dropped to κ = 0.52 after adding three new annotators, which triggered an immediate calibration session.”
Calibration
Calibration is the process of aligning annotators’ understanding of the task and guidelines. Calibration sessions involve annotators labelling the same set of examples, then discussing disagreements to reach a shared interpretation.
“We run a calibration session at the start of every new task and after any update to the annotation guidelines.”
Gold set — a set of examples with known, verified correct labels, used to measure annotator accuracy. “Each annotator’s daily work includes 10% gold set items to allow continuous quality monitoring.”
Annotation Fatigue and Bias
Annotator fatigue — the degradation in annotation quality that occurs when annotators work for extended periods without breaks. It manifests as increased error rates and decreased IAA.
Position bias — the tendency for annotators to prefer the first option in a preference pair regardless of quality. “We randomise the order of responses in each preference pair to reduce position bias.”
Instruction-following bias — the tendency to prefer responses that appear to follow instructions closely, even when they contain factual errors.
Discussing Quality Issues with Annotation Teams
Clear, respectful communication about quality problems is essential in annotation operations. Here are useful phrases:
Identifying a problem:
- “Our IAA data for this task shows a systematic disagreement on [edge case type].”
- “The gold set accuracy for this annotator cohort has dropped below our threshold of 85%.”
- “We’re seeing inconsistent application of the harmlessness guideline in the safety dimension.”
Proposing a fix:
- “I’d recommend a targeted calibration session focusing specifically on [ambiguous category].”
- “We should revise the guideline to include three additional worked examples for this edge case.”
- “Let’s review the hardest 20 items as a group before the next annotation batch begins.”
Tracking improvement:
- “Following the calibration update, IAA on this dimension improved from κ = 0.54 to κ = 0.71.”
- “Gold set accuracy is back above threshold for all annotators after the refresher training.”
Five Example Sentences
- “The inter-annotator agreement on helpfulness ratings is strong at κ = 0.78, but agreement on factual accuracy has been inconsistent, indicating the guideline needs clarification.”
- “We randomly interleave gold set items throughout each annotator’s queue so they cannot identify which items are being used for quality monitoring.”
- “After the calibration session, the team reached consensus on how to handle preference pairs where both responses contain minor factual errors.”
- “Position bias was confirmed in our data: annotators chose the first response 62% of the time across a balanced sample, well above the expected 50%.”
- “The reward model’s performance on out-of-distribution prompts correlated strongly with the annotation quality of the preference pair dataset used for training.”
Practical Note on Guidelines Writing
Annotation guidelines are technical documents written in English that annotators from diverse backgrounds must understand and apply consistently. Clear, concrete guidelines — with many worked examples — produce higher IAA than abstract descriptions of quality dimensions. When writing guidelines in English, use active voice, short sentences, and concrete examples. Avoid qualifiers like “generally” or “usually” without explaining the exceptions.