Advanced Vocabulary #rlhf#syntheticdata#ai#mlops

Synthetic Data & RLHF Vocabulary

5 exercises — Practice synthetic data and RLHF vocabulary in English: GANs, differential privacy, reward models, preference data, PPO, annotation, and data flywheel.

Core synthetic data & RLHF vocabulary clusters
  • Synthetic data: GAN (Generative Adversarial Network), diffusion model, synthetic tabular data, data augmentation
  • Privacy: differential privacy (DP), epsilon (ε), k-anonymity, l-diversity, t-closeness, data anonymisation
  • RLHF pipeline: supervised fine-tuning (SFT), reward model (RM), PPO, preference data, comparison data, Constitutional AI
  • Annotation: labeller, annotator, annotation guidelines, inter-rater agreement, Cohen's kappa, gold standard
  • Data flywheel: production data, feedback loop, data pipeline, active learning, human-in-the-loop
0 / 5 completed
1 / 5
An ML engineer explains RLHF to a product team:
"RLHF — Reinforcement Learning from Human Feedback — is how GPT-4 and Claude were aligned to be helpful and harmless. The pipeline has three stages. First: supervised fine-tuning on a curated dataset of prompt-response pairs. Second: train a reward model — show human raters two responses to the same prompt, they pick the better one; the reward model learns to predict human preference. Third: use RL (specifically PPO) to optimise the LLM to produce responses the reward model scores highly. The reward model is a proxy for human judgment."
What is the role of the reward model in RLHF, and why is it trained on preference data rather than absolute ratings?