What English level do I need to read "RLHF Vocabulary Guide: Human Feedback, Reward Models, and Annotation Language"?

This article is tagged Advanced. If you find the vocabulary difficult, start with a related Vocabulary vocabulary exercise first, then come back — technical reading gets much easier once the core terms feel familiar.

Is this article free to read?

Yes. Every article on CoderSlingo, including this one, is free to read with no account, sign-up, or paywall.

How is reading this article different from doing an exercise?

Articles like this one explain concepts and vocabulary in context through prose, while exercises are interactive drills — fill-in-the-blank, matching, and multiple-choice — that test and reinforce specific terms. Reading builds understanding; exercises build recall.

Can I practice the vocabulary used in this article?

Yes — this article's topic lines up with our #AI exercises. Use the "Practice this vocabulary" link below to jump straight into a matching drill.

How long does this article take to read?

About 8 min. Most CoderSlingo articles are written to be read in one sitting, without needing a dictionary open in another tab.

Do I need to create an account to read or save this article?

No account is required to read any article. If you complete exercises elsewhere on the site, your progress is saved locally in your browser — no login needed.

What if I don't understand a technical term used in the article?

Check the site Glossary for plain-English definitions of common IT terms — HTTP status codes, Git commands, design patterns, and more — or look up the related vocabulary module for this topic.

Can I share or link to this article?

Yes — use the Twitter/X or LinkedIn share buttons at the end of the article, or copy the page URL directly. Attribution back to CoderSlingo is appreciated but the content is free to reference.

How often is new content like this published?

New articles are added regularly across all categories, alongside new vocabulary sets and exercises. Tag pages (like this article's tags) are a good way to find related content as it's published.

Where can I find more articles like this one?

See the "Related Articles" section below for hand-picked follow-ups, or browse all Vocabulary articles from the main Blog index.

RLHF Vocabulary Guide: Human Feedback, Reward Models, and Annotation Language

Working in RLHF Requires Precise English

Reinforcement Learning from Human Feedback (RLHF) has become a standard technique for aligning large language models. Engineers, researchers, and annotation quality specialists working on RLHF pipelines communicate in a specialised vocabulary that sits at the intersection of machine learning, data labelling, and experimental design.

If you work in this space and English is not your first language, this guide gives you the terminology and the context to use it confidently.

Core RLHF Pipeline Vocabulary

Term	Definition
Preference pair	A pair of model outputs shown to an annotator, who selects the preferred one
Comparison data	The dataset of preference pairs collected from annotators
Reward model	A neural network trained to predict human preferences, producing a scalar reward signal
Reward signal	The numerical value output by a reward model, used to guide policy training
Policy	The language model being fine-tuned via reinforcement learning
Reference model	The frozen pre-trained model used as a baseline to constrain policy updates
KL divergence	A measure of how far the policy has drifted from the reference model
Calibration	The process of aligning a model’s confidence scores to actual accuracy rates

The preference pair is the atomic unit of RLHF data. An annotator sees two completions for the same prompt and picks the better one. The quality of your reward model is directly constrained by the quality of the preference annotations — which is why annotation guidelines and quality control matter so much.

Annotation Pipeline Vocabulary

Term	Definition
Annotation guideline	A document instructing annotators on how to label data for a specific task
Task instruction	The specific prompt given to an annotator for a single annotation job
Label schema	The set of possible labels or ratings an annotator can assign
Rubric	A structured scoring framework with criteria and examples for each score level
Edge case	A scenario that is difficult to label because it falls outside the guideline’s main cases
Annotator bias	Systematic differences in how a particular annotator labels data versus others
Gold standard	A set of examples with known correct labels, used to calibrate annotators
Calibration set	A sample of annotations reviewed together to align annotator understanding

When writing annotation guidelines, the word “should” is ambiguous — does it mean “must” or “is preferred”? In guideline writing, use “must” for requirements and “prefer” or “favour” for best practices. This distinction reduces annotation errors significantly.

Annotation Quality Vocabulary

Term	Definition
Inter-annotator agreement (IAA)	The degree to which independent annotators produce the same labels
Cohen’s kappa (κ)	A statistical measure of IAA that accounts for chance agreement; ranges from -1 to 1
Fleiss’ kappa	An extension of Cohen’s kappa for more than two annotators
Intraclass correlation (ICC)	A measure of agreement for continuous ratings
Adjudication	The process of resolving disagreements between annotators, often by a senior reviewer
Consensus labelling	A label determined by majority vote among multiple annotators
Annotation throughput	The number of items labelled per annotator per unit of time
Label noise	Incorrect or inconsistent labels in a training dataset

A Cohen’s kappa of 0.6–0.8 is considered substantial agreement; above 0.8 is near-perfect. When discussing IAA scores with colleagues, contextualise the number: “Our kappa is 0.71, which is substantial, but we see a notable drop on adversarial examples — those need revised guidance.”

Reward Model Training Vocabulary

Term	Definition
Reward hacking	When a policy learns to exploit weaknesses in the reward model rather than align with true intent
Goodhart’s Law	”When a measure becomes a target, it ceases to be a good measure” — describes reward hacking
Overoptimisation	Excessive optimisation for the reward model, causing the policy to degrade in real quality
Regularisation	Techniques (such as KL penalty) that constrain the policy to prevent overoptimisation
Human preference distribution	The distribution of true human preferences the reward model is trying to approximate

Example Sentences

“The reward model is overfitting to surface features of the preference pairs — verbose responses are getting high scores regardless of factual accuracy.”
“Our inter-annotator agreement dropped from 0.74 to 0.61 after we introduced the new rubric; I suspect the helpfulness dimension is ambiguous and needs worked examples.”
“Before we run the next calibration session, let’s review the edge cases where annotators most frequently disagree and update the guidelines accordingly.”
“The KL divergence between the policy and the reference model has been increasing over training — we may need to strengthen the regularisation coefficient.”
“Adjudication of the most contested preference pairs should go to the domain expert reviewer, not the general pool, to preserve label quality.”

Common Register Notes

When presenting RLHF work to a mixed audience of ML engineers and product stakeholders, avoid assuming familiarity with statistical terms. Replace “our kappa is 0.7” with “our annotators agree roughly 70% of the time after accounting for random chance, which is considered strong agreement for this type of task.”

The word “alignment” is used both in the technical sense (aligning model outputs to human preferences) and in the broader AI safety sense (ensuring AI systems behave safely). Clarify which sense you mean when the context could be ambiguous.

RLHF Vocabulary Guide: Human Feedback, Reward Models, and Annotation Language

Working in RLHF Requires Precise English

Core RLHF Pipeline Vocabulary

Annotation Pipeline Vocabulary

Annotation Quality Vocabulary

Reward Model Training Vocabulary

Example Sentences

Common Register Notes

What to Read Next

Practice This Vocabulary

IT Collocations Drills

Interview Preparation

IT Vocabulary Modules

Frequently Asked Questions