TRL and RLHF: English Vocabulary for LLM Fine-Tuning Engineers
Learn English vocabulary for TRL and RLHF: PPO trainer, reward model, DPO, ORPO, SFT, and chat templates for fine-tuning large language models.
Fine-tuning large language models has evolved rapidly from a niche research activity into a core engineering skill at many organisations. The Hugging Face TRL library — Transformer Reinforcement Learning — has become the standard toolkit for this work, and alongside it has emerged a rich, precise vocabulary that researchers and engineers use daily. Whether you are reading a paper, attending an ML conference, or reviewing a colleague’s training script, fluency with these terms in English will make you a stronger collaborator.
Key Vocabulary
SFT (Supervised Fine-Tuning) — the process of further training a pre-trained language model on a labelled dataset of input–output pairs, teaching it to follow a specific format or behaviour before any preference-based alignment step. Definition sentence: SFT is almost always the first stage in a fine-tuning pipeline; it establishes a baseline model that subsequent alignment methods can then refine. Example: “We ran SFT on 50,000 instruction–response pairs before moving on to the preference alignment stage.”
PPO trainer — the component in TRL that implements Proximal Policy Optimisation, an algorithm that updates the model’s weights using reward signals while preventing the policy from drifting too far from the original model in a single update step.
Definition sentence: The PPOTrainer in TRL manages the rollout loop, reward scoring, and gradient updates in a single training object.
Example: “Tuning the KL penalty coefficient in the PPO trainer took several experimental runs before we found a stable configuration.”
Reward model — a separately trained model that assigns a scalar score to a piece of generated text, representing how well that output aligns with human preferences. Definition sentence: The reward model is trained on preference data — pairs of responses where a human has indicated which one is better — and its outputs drive the PPO update signal. Example: “Our reward model was trained on 12,000 pairwise comparisons and achieved 78% agreement with held-out human annotations.”
DPO (Direct Preference Optimisation) — an alignment algorithm that bypasses the need for a separate reward model by directly optimising the policy on human preference pairs using a classification-style loss. Definition sentence: DPO has become popular because it is simpler to implement and more stable to train than PPO while achieving comparable alignment quality on many tasks. Example: “We switched from PPO to DPO after the reward model kept over-optimising and producing reward-hacked outputs.”
ORPO (Odds Ratio Preference Optimisation) — a more recent alignment algorithm that combines SFT and preference optimisation into a single training objective, eliminating the need for a separate SFT phase. Definition sentence: ORPO applies a penalty to rejected responses within the same loss function that rewards chosen responses, simplifying the training pipeline considerably. Example: “ORPO reduced our total compute cost because we no longer needed two separate training runs for SFT and alignment.”
Chat template — a structured string format that wraps user messages, assistant turns, and system prompts with special tokens, telling the model which role produced each piece of text.
Definition sentence: Every base model has its own chat template, and applying the wrong one at inference time causes degraded or incoherent outputs.
Example: “The tokeniser’s apply_chat_template method takes a list of message dictionaries and returns the correctly formatted input string.”
Preference data — a dataset of pairs (or groups) of model responses to the same prompt, annotated with human or AI judgements about which response is preferable, used to train reward models or run DPO. Definition sentence: The quality of your preference data is the single most important factor in determining the quality of your aligned model. Example: “We collected preference data through an internal annotation tool where subject-matter experts ranked responses on helpfulness and accuracy.”
PEFT/LoRA in TRL context — Parameter-Efficient Fine-Tuning methods, most commonly Low-Rank Adaptation, which allow TRL trainers to update only a small fraction of the model’s parameters, making fine-tuning feasible on modest hardware.
Definition sentence: TRL’s trainers integrate natively with the peft library, so you can wrap any base model with a LoRA adapter and pass it directly to SFTTrainer or DPOTrainer.
Example: “We fine-tuned a 7B model on a single A100 by using LoRA with rank 16, which reduced the trainable parameter count by more than 99%.”
Useful Phrases
- “We’re running a three-stage pipeline: SFT on the instruction dataset, then a reward model trained on pairwise preferences, then PPO to align the policy.”
- “The DPO loss function treats the preference pairs directly, so there’s no need to maintain a separate reward model in memory during training.”
- “Make sure you apply the correct chat template before tokenising — mixing up Llama and Mistral templates is a very easy mistake to make.”
- “Our LoRA adapters are tiny compared to the base model weights, so we can store a dozen fine-tuned variants without significant storage overhead.”
- “We evaluated the aligned model against the SFT baseline using win-rate on a held-out preference set, and DPO came out ahead in 63% of comparisons.”
Common Mistakes
Saying “we trained the model with reinforcement learning from human feedback” when describing DPO
RLHF technically refers to the full loop of human annotation, reward model training, and PPO optimisation. DPO is not RLHF — it bypasses the reward model entirely. It is more precise to say “we used a preference-based alignment method” or “we applied DPO” rather than collapsing everything under the RLHF umbrella.
Mispronouncing or misusing “proximal”
The word proximal means “close to the point of origin.” Engineers sometimes confuse it with approximate or proximity. In the phrase Proximal Policy Optimisation, proximal refers to the constraint that keeps the updated policy close to the original policy. You do not need to explain this in conversation, but knowing the meaning helps you use the term correctly in context.
Treating “alignment” as synonymous with “fine-tuning”
Fine-tuning refers to any continued training of a pre-trained model. Alignment is a subset of fine-tuning that specifically aims to make the model’s outputs safer, more helpful, or more consistent with human values. A model can be fine-tuned for a domain-specific task — say, legal document summarisation — without undergoing any alignment procedure at all. Using the terms precisely will prevent confusion when discussing project goals.
The TRL and RLHF space moves quickly, with new algorithms appearing every few months. Building a solid foundation in the English vocabulary around SFT, PPO, DPO, and preference data will help you read new papers faster, evaluate new techniques more critically, and contribute more effectively to your team’s fine-tuning work.