Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
43 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

Reinforcement Learning from Human Feedback (2504.12501v2)

Published 16 Apr 2025 in cs.LG

Abstract: Reinforcement learning from human feedback (RLHF) has become an important technical and storytelling tool to deploy the latest machine learning systems. In this book, we hope to give a gentle introduction to the core methods for people with some level of quantitative background. The book starts with the origins of RLHF -- both in recent literature and in a convergence of disparate fields of science in economics, philosophy, and optimal control. We then set the stage with definitions, problem formulation, data collection, and other common math used in the literature. The core of the book details every optimization stage in using RLHF, from starting with instruction tuning to training a reward model and finally all of rejection sampling, reinforcement learning, and direct alignment algorithms. The book concludes with advanced topics -- understudied research questions in synthetic data and evaluation -- and open questions for the field.

Summary

  • The paper demonstrates that RLHF significantly improves language model outputs by incorporating human preference data into a structured three-step process.
  • It outlines a pipeline involving supervised finetuning, reward model training on pairwise comparisons, and reinforcement finetuning using policy optimization algorithms.
  • The approach unlocks advanced capabilities like complex reasoning and nuanced behavior while addressing challenges such as over-optimization and computational complexity.

Reinforcement Learning from Human Feedback (RLHF) is presented as a crucial technique for incorporating human information into AI systems, particularly LLMs. Initially used to solve problems where direct reward specification is difficult, RLHF gained prominence with the release of models like ChatGPT. It is now considered a key component of "post-training," a broader set of techniques aimed at making LLMs more useful for downstream tasks.

The paper outlines the basic RLHF pipeline as a three-step process:

  1. Supervised Finetuning (SFT) or Instruction Finetuning (IFT): Training a LLM to follow instructions and adopt specific formats, primarily using labeled examples. This teaches basic instruction-following abilities and formatting.
  2. Reward Model (RM) Training: Collecting human preference data (comparisons between model outputs) and training a separate model to predict human preferences. This RM serves as the optimization target for the subsequent RL stage.
  3. Reinforcement Finetuning (RFT) or Policy Optimization: Optimizing the LLM policy using an RL algorithm to maximize the reward predicted by the RM. The model generates responses, the RM scores them, and the policy is updated based on these scores.

RLHF's primary function is described as integrating subtle stylistic and behavioral features into models. Unlike SFT, which optimizes token by token, RLHF tunes responses at the sequence level and provides feedback (positive or negative) on entire generations. This allows for better generalization across domains compared to instruction tuning alone. However, RLHF is more complex and computationally expensive, requiring careful control of the optimization process and being prone to issues like over-optimization and length bias.

A complementary intuition for post-training is the "elicitation interpretation," suggesting that post-training extracts and amplifies valuable behaviors already latent in the large base model from pretraining. This is contrasted with the "Superficial Alignment Hypothesis," which posits that alignment is mostly about learning style from a small number of examples. The paper argues that while small datasets can influence style, large-scale post-training, including RLHF and newer RFT methods, can unlock significant new capabilities beyond just style, such as complex reasoning.

The history of RLHF is traced from early work on RL from preferences in control problems (e.g., TAMER [28], COACH [29], Christiano et al. [1]) and reward modeling for alignment [32] to its application to LLMs (Ziegler et al. [33], Summarization [2], InstructGPT [3], WebGPT [4]) and finally the ChatGPT era, where it became a central technique for leading models like Claude [5], Llama 2 [43], Llama 3 [23], and Nemotron 4 [24]. The field is evolving into "preference finetuning" (PreFT), encompassing Direct Alignment Algorithms and Reinforcement Finetuning (RFT) for verifiable domains like reasoning [44, 45, 46].

Key definitions are provided for understanding RLHF, including ML terms like KL divergence, NLP terms like Prompt, Completion, Chosen Completion, Rejected Completion, Preference Relation, and Policy, and RL terms like Reward, Action, State, Trajectory, Trajectory Distribution, Policy, Value Function, Q-Function, Advantage Function, and On-policy/Off-policy data. RLHF-specific terms include Reference Model. Extended glossary terms cover Synthetic Data, Distillation (general and Knowledge Distillation [50]), In-context Learning (ICL), and Chain of Thought (CoT) [53].

The standard RL setup is manipulated for RLHF by:

  1. Using a learned reward model rθr_\theta instead of a fixed reward function.
  2. Operating in a single-turn setting where the prompt is the initial state and the completion is a sequence of actions, with no state transitions influencing the next action (in the traditional sense).
  3. Attributing rewards to the entire generated response (a bandit problem), rather than individual actions/tokens. The optimization objective becomes maximizing the expected reward of trajectories sampled from the policy, often with a regularization term to prevent drifting too far from a reference policy.

Various Optimization Tools are used:

  • Reward Modeling (Chapter 7): Training a model to predict a scalar reward based on human preference data.
  • Instruction Finetuning (Chapter 9): The foundational step to adapt models to conversational formats.
  • Rejection Sampling (Chapter 10): A simple baseline where multiple completions are generated, scored by a reward model, and only the top-scoring ones are used for SFT.
  • Policy Gradients (Chapter 11): RL algorithms like PPO, REINFORCE, GRPO used to directly update the model parameters based on the RM signal.
  • Direct Alignment Algorithms (Chapter 12): Methods like DPO that optimize the policy directly from preference data without an intermediate RM.

The "canonical RLHF recipe" involves SFT, training a Reward Model on pairwise prompts, and then training the SFT model with an RL optimizer against the RM, sampling new generations. Modern post-training can involve many more iterative stages.

Preference Data is the critical input for RLHF because specifying complex human values directly is infeasible. Collecting this data, often via pairwise comparisons, fuels the reward model training. Interfaces for collecting preferences vary from internal tools to public platforms like ChatBotArena [72] and thumbs-up/down systems. Data can be rankings (relative order) or ratings (scores), though rankings are more common for training RMs. Structured preference data can be automatically generated based on task constraints or verifiable outcomes (e.g., correct vs. incorrect answers in math). Sourcing human preference data is a complex and costly process involving vendors, detailed instructions, and iterative refinement. AI feedback (RLAIF) and synthetic data are increasingly used to augment or replace human data, offering cost benefits but introducing different biases.

Reward Modeling involves training a model to predict the probability of a completion being preferred. The loss function is typically derived from the Bradley-Terry model [80], which models pairwise comparison probabilities. Implementations often involve adding a small linear head to a pretrained LM. The loss minimizes the negative log-likelihood of predicting the preferred response given a pair, often using a sigmoid function applied to the difference in predicted rewards. Variants include using a preference margin [43], balancing multiple comparisons per prompt [3], or using a K-wise loss based on the Plackett-Luce model [83].

  • Outcome Reward Models (ORMs) [84] predict the probability of a completion being correct for verifiable tasks (e.g., math). They often use a per-token cross-entropy loss on a binary correct/incorrect label.
  • Process Reward Models (PRMs) [44] score intermediate steps in a reasoning process. They require step-by-step annotations and often predict a score (-1, 0, 1) at the end of each step. A table compares Reward Models, ORMs, PRMs, and Value Functions based on what they predict, how they are trained, and their LM structure. Generative Reward Modeling uses LLMs (LLM-as-a-judge [86]) to generate critiques or preference ratings, a cost-effective alternative to human labeling, although recent work suggests dedicated RMs can still outperform them on certain evaluations.

Regularization is essential in RLHF to prevent the model from drifting too far from the initial policy and generating nonsensical outputs (over-optimization). The most common technique is adding a KL distance penalty between the current policy and a reference policy (often the initial SFT model) to the reward signal. The KL penalty is typically computed based on the log probabilities of generated tokens. Other forms of regularization include adding a pretraining negative log-likelihood term to the loss or using margin losses in reward model training or direct alignment algorithms.

Instruction Finetuning (IFT) is the necessary first step to make LLMs follow instructions. It uses the standard autoregressive negative log-likelihood loss on datasets of instruction-response pairs. A core component is the chat template, which formats user prompts, system messages, and assistant responses using special tokens (e.g., ChatML <|im_start|>role\ncontent<|im_end|>\n). Best practices for IFT emphasize high-quality data, using synthetic data alongside human data, and aligning the training distribution with downstream tasks.

Rejection Sampling (RS) is a simple preference fine-tuning technique. It involves generating multiple completions for a prompt using the current model, scoring them with a trained reward model, selecting the top-N based on reward (either top per prompt or top overall), and then fine-tuning the model on these selected high-reward completions using standard SFT. This process can be iterated. Best-of-N (BoN) sampling is a related technique used at inference time, which selects the best completion from N generations based on the RM score, without modifying the model itself.

Policy Gradient Algorithms like REINFORCE [130], PPO [133], and GRPO [136] are used to optimize the LLM policy directly using the reward signal (from an RM or verifiable source). These algorithms maximize the expected cumulative reward (return). The core idea is to update the policy parameters in the direction of the gradient of the expected return. The gradient often involves an advantage function A(s,a)A(s,a), which measures how much better an action is than the expected value of the state.

  • REINFORCE is a basic Monte Carlo policy gradient algorithm that uses the total return (possibly baselined) to update parameters. Variants like REINFORCE Leave One Out (RLOO) [128] use the average reward of other samples in the batch as a baseline for variance reduction, particularly when multiple responses per prompt are generated.
  • Proximal Policy Optimization (PPO) [133] is a popular algorithm that uses a clipped surrogate objective to constrain policy updates within a trust region, improving stability. PPO typically requires learning a separate value network to estimate the advantage function. The loss involves a ratio of the new policy probability to the old policy probability.
  • Group Relative Policy Optimization (GRPO) [136] is a PPO-inspired algorithm that avoids learning a separate value function. It computes the advantage based on the rewards of a group of responses generated for the same prompt (e.g., normalizing by the mean and standard deviation of rewards within the group). It also integrates the KL penalty directly into the loss function. Implementation details involve loss aggregation (per-token vs. per-sequence mean/sum), managing KL penalties, and handling value function training (for PPO). Auxiliary topics include Generalized Advantage Estimation (GAE) [129] for improved advantage estimation and the concept of "double regularization" (internal algorithm regularization + external KL penalty).

Direct Alignment Algorithms (DAAs) like Direct Preference Optimization (DPO) [19] offer an alternative to RL by optimizing the policy directly from pairwise preference data without training a separate RM. DPO derives an objective function based on the log-probability ratio of chosen and rejected completions under the policy and a reference model. The loss minimizes the negative log-likelihood of the model predicting the chosen completion as better than the rejected one, weighted by a β\beta parameter that controls the implicit KL regularization. DPO effectively fits an implicit reward model whose optimal policy can be derived in closed form. While simpler than RL, DPO has numerical concerns and weaknesses, including potential overfitting and preference displacement (reducing the probability of both chosen and rejected responses, but more so for rejected). Variants like IPO [156], REBEL [118], and ODPO [157] attempt to address these. DAAs are widely used due to their simplicity, although some studies suggest RL-based methods might achieve slightly higher peak performance, potentially due to their use of online data generation.

Constitutional AI (CAI) [18] and RL from AI Feedback (RLAIF) [172] are techniques that leverage AI models to generate or augment feedback data. CAI uses a set of human-written principles ("constitution") to guide an LLM to critique and revise its own outputs or generate pairwise preference data based on these principles. This synthetic data is then used for SFT and RLHF. RLAIF provides a cheaper alternative to human data, opening up RLHF experimentation. While some benchmarks show RLAIF performing comparably to human data, the long-term trade-offs regarding fine-grained control and new capabilities are still being explored. Dedicated "judge" LLMs [93] have been developed for generating critiques.

Reasoning Training (Reinforcement Finetuning, RFT) and Inference-Time Scaling represent a newer focus for RL in post-training. Inspired by models like OpenAI's o1 [47] and DeepSeek R1 [138], this involves training models with Reinforcement Learning with Verifiable Rewards (RLVR), where a scoring function (e.g., checking the final answer in math) provides a positive reward for correct outputs and zero otherwise. This RL training on verifiable domains helps models learn to perform complex reasoning steps, often resulting in longer, more deliberate "chain-of-thought" outputs at inference. Unlike early RLHF length bias, this increased length is correlated with actual performance improvements. RFT involves iterating on sampling answers, taking gradient steps towards correct ones, and repeating, often for hundreds or thousands of epochs on the same data to reinforce desired behaviors.

Synthetic Data and Distillation have become indispensable in post-training due to the cost and difficulty of obtaining high-quality human data. Synthetic data, generated by stronger AI models, is used extensively for generating prompts, completions, AI feedback, and filtering data. While concerns about "model collapse" from training on synthetic data have been raised [197], leading models successfully use it by employing diverse data sources and reinforcement learning [198]. Distillation, colloquially meaning training smaller models on outputs from larger ones, takes the form of using larger models as data engines or for transferring specific skills (e.g., math reasoning) [203].

Evaluation of RLHF and post-trained models is a rapidly evolving area. Early "chat-phase" evaluations focused on general conversational quality (MT-Bench [86], AlpacaEval [87]). The "multi-skill era" expanded to include a wider range of benchmarks covering knowledge, reasoning, math, coding, and safety (MMLU [205], MATH [210], HumanEval [212], IFEval [214]). The current era focuses on challenging reasoning and tool-use tasks (GPQA [215], SWE-Bench+ [217]). Prompting formats for evaluation have evolved from few-shot to zero-shot and now increasingly incorporate chain-of-thought generation. Challenges in evaluation include the distinction between using public benchmarks for training vs. observing them as test sets, the potential for training data contamination [229], and the difficulty in guaranteeing reproducibility across different labs and evaluation tools [232].

Over Optimization refers to the phenomenon where optimizing a proxy objective (like RM score) eventually leads to a decrease in performance on the true objective (real-world usefulness). This is a known issue in RL [240] and is framed as an instance of Goodhart's law. Qualitative signs include models producing generic phrases, repetitive text, or exhibiting over-refusal or sycophancy [243]. Quantitative studies measure over-optimization by tracking performance against KL divergence from the reference policy or using train/test splits for the reward model. Over-optimization is considered fundamental to RLHF with learned rewards but can be mitigated through larger models, RM ensembles, or different optimization strategies. Misalignment, such as promoting sycophancy, can be a consequence of over-optimizing the proxy reward.

The notion that RLHF is merely Style and Information transfer is discussed. While RLHF clearly influences the style (e.g., "chattiness"), style is argued to be intertwined with information and valuable for user experience. The "chattiness paradox" describes how RLHF can boost scores on some benchmarks by increasing response length, which aligns with average user preferences observed in evaluations like ChatBotArena, but doesn't always translate to better performance on harder tasks. Chattiness emerges because preference datasets often favor responses from models with a certain stylistic tendency (like GPT-4), and RLHF methods increase the probability of sequences exhibiting these styles.

Finally, the paper touches on the role of RLHF in Product, UX, and Model Character. As LLMs become products, RLHF is used to fine-tune nuanced aspects of model behavior beyond basic instruction following or factual correctness. Character training, an important but understudied area, uses post-training techniques (like CAI) to instill traits like curiosity or thoughtfulness, focusing on the manner rather than just the content of responses. Model Specifications provide a public document detailing intended model behaviors, serving as a valuable reference point for designers, developers, and the public to understand the goals behind model alignment. RLHF and post-training techniques are seen as key interface points for rapidly incorporating new features and UX improvements into product cycles.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Authors (1)