This paper investigates why a complex, two-stage training process involving a reward model and reinforcement learning is more effective for fine-tuning foundation models than directly optimizing the policy using maximum likelihood estimation.
Here's a breakdown of the key aspects:
1. Background and Motivation
The paper starts by addressing a fundamental question: Why do today's most powerful models rely on a two-stage process for preference fine-tuning (PFT), also known as reinforcement learning from human feedback (RLHF)? In this process, a reward model (RM) is first trained on a dataset of preferences. Then, this reward model guides a reinforcement learning (RL) procedure to optimize the policy's parameters. This seems odd because directly maximizing the likelihood of preferred completions relative to dis-preferred ones using maximum likelihood estimation (MLE) should, in theory, be more efficient.
The authors highlight that despite the information loss that occurs when using a reward model (as suggested by the data processing inequality), the two-stage online techniques consistently outperform simpler offline approaches. This performance gap motivates the central question: What is the value of this two-stage interactive fine-tuning if the ultimate goal is simply to maximize data likelihood?
2. Theoretical Analysis: Information Geometry of Fine-Tuning
The paper begins with a theoretical analysis, using information geometry, to understand the relationship between online and offline PFT.
- Markov Decision Process (MDP): The authors model the problem using a finite-horizon, reward-free MDP. Key elements include:
- : Set of initial states (prompts)
- : Distribution of initial states
- : Action space (set of tokens)
- : State space (set of partial generations)
- : Deterministic dynamics, where is the next state after taking action in state .
- : Horizon or maximum generation length
- : Policy that maps a prefix to a distribution over next tokens
- : Trajectory generated by sampling an initial state and then sampling from the policy times.
- : Probability of sampling a trajectory under policy .
- : Set of policies.
- : Dataset of trajectory-level preferences, where indicates that trajectory is preferred over
- : Uniform distribution over the dataset
- : Full space of trajectories
- : Set of trajectories with some prefix
- : Reference policy
- Reward Models: The paper distinguishes between global and local reward models. Global reward models assign a scalar value to an entire trajectory, while local reward models decompose the reward into a sum of per-token rewards (related to policy log probabilities).
- : Set of policies
- : Set of reward models, with for all .
The set of local RMs is defined as:
, where is the reward for trajectory , is the probability of taking action in state under policy , and is the horizon.
Fine-Tuning Objective: The paper formulates fine-tuning as a reverse KL-regularized policy optimization problem:
- : Forward KL divergence measuring how well the learned policy matches the data distribution .
- : Reverse KL divergence ensuring the learned policy stays close to the reference policy .
- : Regularization coefficient.
For simplicity, the authors set and replace the KL regularization with entropy regularization, leading to:
* is the (causal) entropy of the policy.
Maximum Likelihood Estimation (MLE): The paper shows how MLE can be used to fit both global reward models and policies. Under the Bradley-Terry (BT) model of preferences, the probability of preferring trajectory over given the same initial state is:
- : means "preferred to"
- : Sigmoid function
The global RM is fit via MLE:
* : Preferred trajectory in pair * : Dis-preferred trajectory in pair
Similarly, a policy can be fit via MLE by substituting the sum of log probabilities for :
Maximum Entropy: Given a global reward model , the soft-optimal policy is computed as:
The paper proves that solving this soft RL problem is equivalent to a reverse KL projection from (the distribution induced by the soft-optimal policy) onto the set of policy-induced trajectory distributions .
Equivalence Theorems: The paper presents key equivalence theorems:
- Theorem 2.2: If the reward and policy classes are isomorphic (i.e., ), then RLHF is equivalent to MLE.
- Theorem 2.3: RLHF is equivalent to Direct Preference Optimization (DPO) when .
- Theorem 2.4: A variant of the online SPIN algorithm (which is online DPO with supervised fine-tuning data) is equivalent to vanilla offline SFT.
3. Empirical Investigation: The Value of RL in Fine-Tuning
To reconcile the theoretical equivalences with empirical observations, the authors conduct a series of controlled experiments. They focus on learning to summarize from preference feedback using the Pythia series of models. They use DPO as the optimizer for both offline and online PFT to control for confounding variables.
The experiments reveal a significant performance gap between online and offline DPO, even when using the same SFT checkpoint and data for training. This contradicts the theoretical results from Section 2.
4. Hypotheses for the Online-Offline Gap
The authors explore several hypotheses to explain the discrepancy between theory and practice:
- H1: Intrinsic Value of Online Samples: The idea that feedback on samples more likely under the current policy is more useful. The authors argue against this, stating that on-policy data is redundant from an information-theoretic perspective.
- H2: Failure of Offline PFT Regularization to Reference Policy: Offline PFT algorithms may require stronger coverage conditions than online approaches because they struggle to effectively regularize to the reference policy. The authors provide evidence against this, noting that simply adding a reverse KL penalty to DPO doesn't close the gap.
- H3: Relative Ease of Online PFT Optimization: The possibility that offline PFT faces a harder optimization problem. The authors argue against this because they use the same optimizer (DPO) for both online and offline settings. They also explore a refined hypothesis related to computational-statistical gaps but find little support for it through prompt augmentation experiments.
- H4: Global Reward Models Can Be Trained on More Data: This suggests global RMs are more amenable to training on wider data distributions than local RMs/policies. The authors generate a more concentrated dataset using samples only from the SFT policy and find that online DPO still improves performance.
- H5: Global Reward Models Generalize Better Out-of-Distribution (OOD): This suggests that online PFT techniques are better at maximizing rewards when the learned reward model's peak falls outside the preference data's support. The authors train local RMs without regularization and compare them to global and DPO RMs, finding that better in-distribution margins correlate with better OOD generalization.
5. Generation-Verification Gap and Proper Policy Learning
The paper proposes an alternative hypothesis based on the idea that, for many problems, verification is simpler than generation. They hypothesize that the first step of online fine-tuning finds a relatively simple reward model, and the second step finds a soft-optimal policy for that reward model. This reduces the search space compared to offline fine-tuning, which searches over all possible policies.
- H6: Online PFT is Proper Policy Learning: This states that offline FT solves a harder, improper learning problem, while online FT avoids it by performing proper learning over a restricted policy space.
The authors provide evidence supporting H6:
- Experiments show that using a significantly smaller global RM leads to nearly identical Best-Of-N (BoN) performance as using an RM the same size as the policy.
- They design experiments to close the generation-verification gap, such as using two-word summaries or the ROUGE-L metric as the reward function. In these cases, the gap between online and offline PFT diminishes or disappears.
6. Isomorphisms Are Not a Two-Way Street
The paper addresses the question of why it's harder to learn a policy than a reward model if they are isomorphic in soft RL. They explain that while isomorphisms exist, the mapping from reward to policy requires solving a hard reinforcement learning problem. In contrast, the mapping between policy and Q-function (a local reward model) is more direct. This suggests that optimizing over local reward models like DPO doesn't escape the statistical difficulty of directly learning the generator.