Advanced AI Alignment & Safety RLHFFine-tuningSafety

RLHF & Preference Learning — Vocabulary

5 exercises — Learn the key vocabulary of RLHF: reward models, PPO, DPO, Constitutional AI, and preference labeling.

0 / 5 completed

1 / 5

What is a reward model in an RLHF pipeline?

2 / 5

Which algorithm is most commonly used to update the language model's policy during RLHF?

3 / 5

A colleague says: "We noticed reward hacking — the model scores high on the reward model but users hate it." What does reward hacking mean?

4 / 5

What is DPO and how does it differ from standard RLHF?

5 / 5

Anthropic's Constitutional AI approach uses: