5 exercises — Practice synthetic data and RLHF vocabulary in English: GANs, differential privacy, reward models, preference data, PPO, annotation, and data flywheel.
Core synthetic data & RLHF vocabulary clusters
Synthetic data: GAN (Generative Adversarial Network), diffusion model, synthetic tabular data, data augmentation
Data flywheel: production data, feedback loop, data pipeline, active learning, human-in-the-loop
0 / 5 completed
1 / 5
An ML engineer explains RLHF to a product team: "RLHF — Reinforcement Learning from Human Feedback — is how GPT-4 and Claude were aligned to be helpful and harmless. The pipeline has three stages. First: supervised fine-tuning on a curated dataset of prompt-response pairs. Second: train a reward model — show human raters two responses to the same prompt, they pick the better one; the reward model learns to predict human preference. Third: use RL (specifically PPO) to optimise the LLM to produce responses the reward model scores highly. The reward model is a proxy for human judgment." What is the role of the reward model in RLHF, and why is it trained on preference data rather than absolute ratings?
Reward model (RM): a model (usually a fine-tuned copy of the base LLM) trained to output a scalar score predicting how much a human would prefer a given response. Used as a proxy for human judgment during the RL phase. Why pairwise comparison? Absolute ratings (1-5 stars) have calibration problems — one rater's 4 is another's 3. Pairwise comparisons ("A is better than B") are more consistent and easier to agree on. RLHF vocabulary: Supervised Fine-Tuning (SFT): Stage 1. Fine-tune the base LLM on a dataset of high-quality prompt-response pairs. Creates the initial helpful model. Preference data: a dataset of (prompt, response_A, response_B, preference) tuples. Collected from human raters comparing responses. PPO (Proximal Policy Optimization): the RL algorithm used to update the LLM. Uses the reward model's scores as the reward signal. Adds a KL penalty to prevent the model from diverging too far from the SFT model. KL penalty: constrains how much the policy (LLM) can change from the reference model per step — prevents reward hacking. Reward hacking: the RL policy finds ways to score highly on the reward model without actually being more helpful. Constitutional AI (CAI): Anthropic's variant — uses AI feedback instead of (or in addition to) human feedback. DPO (Direct Preference Optimisation): trains directly on preference data without a separate RL stage. Simpler than PPO. In conversation: 'The reward model is the hardest part of RLHF. If it's miscalibrated or biased, the RL step amplifies those problems.'
2 / 5
A data scientist explains differential privacy to an engineering team: "Differential privacy (DP) gives a mathematical guarantee about privacy. An algorithm is epsilon-DP if adding or removing any single person's data changes the output distribution by at most e⁾ — a small factor. Lower epsilon means stronger privacy but less accuracy. We add calibrated noise (Laplace or Gaussian) to query outputs. Apple uses DP for keyboard and emoji usage analytics — they can learn aggregate patterns without seeing individual keystrokes. Epsilon less than 1 is considered strong privacy; epsilon around 10 is weak." What is k-anonymity and how does it differ from differential privacy?
k-anonymity: a dataset satisfies k-anonymity if every record is identical to at least k-1 others on the quasi-identifying attributes (age, zip code, gender). Example: k=3 means every row has at least 2 identical rows in the dataset. Weakness: l-diversity and t-closeness are extensions that address attacks on sensitive attributes even within an anonymous group. Differential privacy: a stronger, mathematically rigorous guarantee. Property of an algorithm, not a dataset. Guarantees: the output distribution changes by at most e⁾ when any single record is added/removed. Provably bounds what an adversary can infer about any individual. Privacy vocabulary: Epsilon (ε): the privacy budget. Small ε = strong privacy, high noise, less utility. Large ε = weak privacy, low noise, more utility. Sensitivity: how much a query's output can change when one record is added/removed. Determines how much noise to add. Laplace mechanism: adds Laplace-distributed noise calibrated to sensitivity/epsilon. Gaussian mechanism: adds Gaussian noise. Used with (ε, δ)-DP. Local DP: noise added on the user's device before sending data. Used by Apple and Google. Central DP: noise added by a trusted aggregator. Less noise needed for the same privacy guarantee. Federated learning: train models on decentralised devices without centralising raw data. Often combined with DP. In conversation: 'k-anonymity is a good first step for releasing datasets, but it's not a guarantee — it can be broken with auxiliary information. Differential privacy gives you a mathematical receipt.'
3 / 5
An annotation team lead explains inter-rater agreement to a new ML team: "For training data quality, we need consistency across raters. We measure inter-rater agreement: if two labellers disagree 50% of the time, the labels are noise, not signal. We use Cohen's kappa — it corrects for chance agreement. Kappa above 0.8 is strong; 0.6-0.8 is moderate. When kappa is low, we run adjudication sessions: bring raters together to discuss edge cases and update the annotation guidelines. Gold standard examples — pre-labelled items with known answers — are mixed in to detect rater drift over time." What is the annotation guideline and why does its quality directly impact model performance?
Annotation guideline: the document raters follow. Contains: task definition, label definitions with examples, edge case rules, decision trees for ambiguous cases, examples of correct and incorrect labels, escalation procedures. Quality impact: ambiguous guidelines produce inconsistent labels (low kappa). Low kappa = noisy labels = model learns noise = lower accuracy. A well-written guideline is as important as model architecture. Annotation vocabulary: Labeller / Annotator: the person assigning labels to data. Task: the labelling job (classification, NER, ranking, transcription). Inter-rater reliability (IRR): how consistently different annotators label the same item. Cohen's kappa (κ): IRR metric for two raters. Corrects for chance. κ = (P₀ - Pε) / (1 - Pε). κ > 0.8 = strong. Fleiss kappa: generalisation of Cohen's kappa for more than two raters. Adjudication: process of resolving disagreements — raters discuss and reach consensus, or a senior rater decides. Gold standard: pre-labelled items with known-correct answers mixed into annotation batches to measure rater quality. Rater drift: raters gradually shift their interpretation of guidelines over time. Gold standards detect this. Label smoothing: a training technique that softens hard labels (0/1 → 0.1/0.9) to account for annotation noise. In conversation: 'We spent three days writing the annotation guideline before starting labelling. That investment saved weeks of model retraining from noisy labels.'
4 / 5
A data engineer explains synthetic tabular data to a privacy-conscious client: "We can't train on production data that contains PII. Synthetic data is an alternative: generate a dataset with the same statistical properties as the real data but no real records. GANs — Generative Adversarial Networks — learn the data distribution and generate new samples. CTGAN is specifically designed for tabular data. The key validation: synthetic data must be statistically similar to real (train the same model on both; similar performance means useful synthetic data) but not memorise specific records (privacy test: does any synthetic row match a real row?)." What is the data flywheel concept in AI product development?
Data flywheel: the compounding loop where user interactions generate training data that improves the model which attracts more users. Key mechanisms: Implicit feedback: users clicking, dwell time, re-queries — signals what was helpful without explicit labels. Explicit feedback: thumbs up/down, ratings, corrections. Active learning: model identifies samples it's uncertain about — these are the most valuable to label. Reduces annotation cost by focusing human effort. Human-in-the-loop: humans review model predictions before they're acted on — corrections become training data. GAN vocabulary: Generator: neural network that creates synthetic samples. Discriminator: neural network that distinguishes real from synthetic. Trained adversarially — each improves the other. CTGAN: Conditional Tabular GAN — designed for tabular data, handles mixed types (numeric + categorical). Membership inference attack: can an adversary determine if a specific record was in the training data? Tests whether synthetic data memorises real records. Utility: how useful synthetic data is — measured by training a model on synthetic data and evaluating on real data. Fidelity: how statistically similar synthetic data is to real data — column distributions, correlations. In conversation: 'The data flywheel is why incumbents with large userbases have a structural advantage in AI. Each query is both a product use and a training signal.'
5 / 5
An ML researcher presents Constitutional AI to a safety-focused team: "Constitutional AI (CAI), developed by Anthropic, is an RLHF variant that uses AI feedback instead of human feedback for the harmlessness aspect. We define a 'constitution' — a set of principles ('do not assist with harmful activities', 'be honest'). The AI critiques its own responses against these principles and revises them. A separate AI model rates the revised responses. This scales feedback collection: one human writing the constitution replaces hundreds of human raters for the harmlessness reward model." What is the difference between SFT (Supervised Fine-Tuning) and DPO (Direct Preference Optimisation) in the LLM alignment pipeline?
SFT (Supervised Fine-Tuning): the first stage of alignment. Fine-tunes the pre-trained base LLM on a curated dataset of (prompt, ideal_response) pairs using standard cross-entropy loss. Creates the initial helpful model. Data source: human-written demonstrations of desired behaviour. DPO (Direct Preference Optimisation): an alternative to PPO-based RLHF. Takes preference pairs (prompt, chosen, rejected) and optimises the LLM directly using a closed-form objective derived from the reward maximisation problem. No separate reward model needed. Benefits: simpler training loop, no RL instability, less compute. Trade-offs: still requires preference data; may be less expressive than full RLHF for complex tasks. LLM alignment vocabulary: Alignment: making LLM behaviour match human values — helpful, harmless, honest. Base model: the pre-trained LLM before any fine-tuning. Knows language but not how to follow instructions. Instruction tuning: fine-tuning on (instruction, response) pairs to make the model follow instructions. Often the first SFT stage. Chat model: a model fine-tuned with a specific conversation format and RLHF/DPO for helpful dialogue. PEFT (Parameter-Efficient Fine-Tuning): fine-tune a small subset of parameters. Includes LoRA, prefix tuning. LoRA (Low-Rank Adaptation): fine-tune low-rank update matrices instead of full weights. Widely used for SFT with limited compute. In conversation: 'DPO is becoming the default for alignment — it's much simpler to implement than PPO and produces comparable results for most use cases.'