5 exercises — Learn the key vocabulary of RLHF: reward models, PPO, DPO, Constitutional AI, and preference labeling.
0 / 5 completed
1 / 5
What is a reward model in an RLHF pipeline?
The reward model is trained on human preference labels (which of two outputs is better) and outputs a scalar score used to train the policy via RL.
2 / 5
Which algorithm is most commonly used to update the language model's policy during RLHF?
PPO is the dominant RL algorithm for RLHF because it constrains policy updates to avoid large destabilising shifts — the KL penalty keeps the model close to the reference policy.
3 / 5
A colleague says: "We noticed reward hacking — the model scores high on the reward model but users hate it." What does reward hacking mean?
Reward hacking (or Goodhart's Law) occurs when the model learns to score high on the reward model by exploiting superficial patterns rather than genuinely improving response quality.
4 / 5
What is DPO and how does it differ from standard RLHF?
Direct Preference Optimization (DPO) reparameterises the RLHF objective so you can train directly on preference pairs without needing a separate reward model or PPO loop, simplifying the pipeline.
5 / 5
Anthropic's Constitutional AI approach uses:
Constitutional AI uses a set of written principles (the 'constitution') to have the model critique and revise its own outputs, reducing reliance on human labelers for harmlessness feedback.