Advanced Vocabulary #rlhf#syntheticdata#ai#mlops

Synthetic Data & RLHF Vocabulary

5 exercises — Practice synthetic data and RLHF vocabulary in English: GANs, differential privacy, reward models, preference data, PPO, annotation, and data flywheel.

Core synthetic data & RLHF vocabulary clusters

Synthetic data: GAN (Generative Adversarial Network), diffusion model, synthetic tabular data, data augmentation
Privacy: differential privacy (DP), epsilon (ε), k-anonymity, l-diversity, t-closeness, data anonymisation
RLHF pipeline: supervised fine-tuning (SFT), reward model (RM), PPO, preference data, comparison data, Constitutional AI
Annotation: labeller, annotator, annotation guidelines, inter-rater agreement, Cohen's kappa, gold standard
Data flywheel: production data, feedback loop, data pipeline, active learning, human-in-the-loop

0 / 5 completed

1 / 5

An ML engineer explains RLHF to a product team:
"RLHF — Reinforcement Learning from Human Feedback — is how GPT-4 and Claude were aligned to be helpful and harmless. The pipeline has three stages. First: supervised fine-tuning on a curated dataset of prompt-response pairs. Second: train a reward model — show human raters two responses to the same prompt, they pick the better one; the reward model learns to predict human preference. Third: use RL (specifically PPO) to optimise the LLM to produce responses the reward model scores highly. The reward model is a proxy for human judgment."
What is the role of the reward model in RLHF, and why is it trained on preference data rather than absolute ratings?

2 / 5

A data scientist explains differential privacy to an engineering team:
"Differential privacy (DP) gives a mathematical guarantee about privacy. An algorithm is epsilon-DP if adding or removing any single person's data changes the output distribution by at most e⁾ — a small factor. Lower epsilon means stronger privacy but less accuracy. We add calibrated noise (Laplace or Gaussian) to query outputs. Apple uses DP for keyboard and emoji usage analytics — they can learn aggregate patterns without seeing individual keystrokes. Epsilon less than 1 is considered strong privacy; epsilon around 10 is weak."
What is k-anonymity and how does it differ from differential privacy?

3 / 5

An annotation team lead explains inter-rater agreement to a new ML team:
"For training data quality, we need consistency across raters. We measure inter-rater agreement: if two labellers disagree 50% of the time, the labels are noise, not signal. We use Cohen's kappa — it corrects for chance agreement. Kappa above 0.8 is strong; 0.6-0.8 is moderate. When kappa is low, we run adjudication sessions: bring raters together to discuss edge cases and update the annotation guidelines. Gold standard examples — pre-labelled items with known answers — are mixed in to detect rater drift over time."
What is the annotation guideline and why does its quality directly impact model performance?

4 / 5

A data engineer explains synthetic tabular data to a privacy-conscious client:
"We can't train on production data that contains PII. Synthetic data is an alternative: generate a dataset with the same statistical properties as the real data but no real records. GANs — Generative Adversarial Networks — learn the data distribution and generate new samples. CTGAN is specifically designed for tabular data. The key validation: synthetic data must be statistically similar to real (train the same model on both; similar performance means useful synthetic data) but not memorise specific records (privacy test: does any synthetic row match a real row?)."
What is the data flywheel concept in AI product development?

5 / 5

An ML researcher presents Constitutional AI to a safety-focused team:
"Constitutional AI (CAI), developed by Anthropic, is an RLHF variant that uses AI feedback instead of human feedback for the harmlessness aspect. We define a 'constitution' — a set of principles ('do not assist with harmful activities', 'be honest'). The AI critiques its own responses against these principles and revises them. A separate AI model rates the revised responses. This scales feedback collection: one human writing the constitution replaces hundreds of human raters for the harmlessness reward model."
What is the difference between SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimisation) in the LLM alignment pipeline?