This paper investigates why a complex, two-stage training process involving a reward model and reinforcement learning is more effective for fine-tuning foundation models than directly optimizing the policy using maximum likelihood estimation.
Here's a breakdown of the key aspects:
1. Background and Motivation
The paper starts by addressing a fundamental question: Why do today's most powerful models rely on a two-stage process for preference fine-tuning (PFT), also known as reinforcement learning from human feedback (RLHF)? In this process, a reward model (RM) is first trained on a dataset of preferences. Then, this reward model guides a reinforcement learning (RL) procedure to optimize the policy's parameters. This seems odd because directly maximizing the likelihood of preferred completions relative to dis-preferred ones using maximum likelihood estimation (MLE) should, in theory, be more efficient.
The authors highlight that despite the information loss that occurs when using a reward model (as suggested by the data processing inequality), the two-stage online techniques consistently outperform simpler offline approaches. This performance gap motivates the central question: What is the value of this two-stage interactive fine-tuning if the ultimate goal is simply to maximize data likelihood?
2. Theoretical Analysis: Information Geometry of Fine-Tuning
The paper begins with a theoretical analysis, using information geometry, to understand the relationship between online and offline PFT.
- Markov Decision Process (MDP): The authors model the problem using a finite-horizon, reward-free MDP. Key elements include:
- X: Set of initial states (prompts)
- P0: Distribution of initial states
- A: Action space (set of tokens)
- S: State space (set of partial generations)
- T(s′∣s,a): Deterministic dynamics, where s′ is the next state after taking action a in state s.
- H: Horizon or maximum generation length
- π:S→Δ(A): Policy that maps a prefix to a distribution over next tokens
- ξ: Trajectory generated by sampling an initial state and then sampling from the policy H times.
- Pπ(ξ): Probability of sampling a trajectory ξ under policy π.
- Π: Set of policies.
- D={(ξi,≻i)}i=1N: Dataset of trajectory-level preferences, where ξi≻iξi′ indicates that trajectory ξi is preferred over ξi′
- PD: Uniform distribution over the dataset D
- Ξ: Full space of trajectories
- Ξs0:h: Set of trajectories with some prefix s0:h
- πref∈Π: Reference policy
- Reward Models: The paper distinguishes between global and local reward models. Global reward models assign a scalar value to an entire trajectory, while local reward models decompose the reward into a sum of per-token rewards (related to policy log probabilities).
- Π: Set of policies
- R: Set of reward models, with r:Ξ→R for all r∈R.
The set of local RMs is defined as:
R(Π)={r(ξ)=h=0∑H−logπ(ah∣sh):π∈Π}
, where r(ξ) is the reward for trajectory ξ, π(ah∣sh) is the probability of taking action ah in state sh under policy π, and H is the horizon.
Fine-Tuning Objective: The paper formulates fine-tuning as a reverse KL-regularized policy optimization problem:
π∗=π∈Πargmin DKL(PD∣∣Pπ)+βDKL(Pπ∣∣Pπref)
- DKL(PD∣∣Pπ): Forward KL divergence measuring how well the learned policy π matches the data distribution PD.
- DKL(Pπ∣∣Pπref): Reverse KL divergence ensuring the learned policy π stays close to the reference policy πref.
- β: Regularization coefficient.
For simplicity, the authors set β=1 and replace the KL regularization with entropy regularization, leading to:
π∗=π∈Πargmin DKL(PD∣∣Pπ)−H(π)
* H(π)=Eξ∼π[−h∑logπ(ah∣sh)] is the (causal) entropy of the policy.
Maximum Likelihood Estimation (MLE): The paper shows how MLE can be used to fit both global reward models and policies. Under the Bradley-Terry (BT) model of preferences, the probability of preferring trajectory ξ1 over ξ2 given the same initial state s0 is:
PBT(ξ1≻ξ2∣s0)=σ(r(ξ1)−r(ξ2))
- ≻: means "preferred to"
- σ: Sigmoid function
The global RM is fit via MLE:
r^mle=r∈Rargmin DKL(PD∣∣PPT)
=r∈Rargmaxi=1∑Nlog σ(r(ξi+)−r(ξi−))
* ξi+: Preferred trajectory in pair i
* ξi−: Dis-preferred trajectory in pair i
Similarly, a policy can be fit via MLE by substituting the sum of log probabilities for rπ:
π^mle=π∈R(Π)argmin DKL(PD∣∣PPT)
=π∈R(Π)argmaxi=1∑Nlog σ(π(ξi+)−π(ξi−))
=π∈Πargmaxi=1∑Nlog σ(h=0∑Hlogπ(ah,i+∣sh,i+)−h=0∑Hlogπ(ah,i−∣sh,i−))
Maximum Entropy: Given a global reward model r, the soft-optimal policy π∗ is computed as:
π∗=π∈ΠargmaxEξ∼π[r(ξ)]+H(π)
The paper proves that solving this soft RL problem is equivalent to a reverse KL projection from P∗ (the distribution induced by the soft-optimal policy) onto the set of policy-induced trajectory distributions Π.
Equivalence Theorems: The paper presents key equivalence theorems:
- Theorem 2.2: If the reward and policy classes are isomorphic (i.e., R=R(Π)), then RLHF is equivalent to MLE.
- Theorem 2.3: RLHF is equivalent to Direct Preference Optimization (DPO) when R=R(Π).
- Theorem 2.4: A variant of the online SPIN algorithm (which is online DPO with supervised fine-tuning data) is equivalent to vanilla offline SFT.
3. Empirical Investigation: The Value of RL in Fine-Tuning
To reconcile the theoretical equivalences with empirical observations, the authors conduct a series of controlled experiments. They focus on learning to summarize from preference feedback using the Pythia series of models. They use DPO as the optimizer for both offline and online PFT to control for confounding variables.
The experiments reveal a significant performance gap between online and offline DPO, even when using the same SFT checkpoint and data for training. This contradicts the theoretical results from Section 2.
4. Hypotheses for the Online-Offline Gap
The authors explore several hypotheses to explain the discrepancy between theory and practice:
- H1: Intrinsic Value of Online Samples: The idea that feedback on samples more likely under the current policy is more useful. The authors argue against this, stating that on-policy data is redundant from an information-theoretic perspective.
- H2: Failure of Offline PFT Regularization to Reference Policy: Offline PFT algorithms may require stronger coverage conditions than online approaches because they struggle to effectively regularize to the reference policy. The authors provide evidence against this, noting that simply adding a reverse KL penalty to DPO doesn't close the gap.
- H3: Relative Ease of Online PFT Optimization: The possibility that offline PFT faces a harder optimization problem. The authors argue against this because they use the same optimizer (DPO) for both online and offline settings. They also explore a refined hypothesis related to computational-statistical gaps but find little support for it through prompt augmentation experiments.
- H4: Global Reward Models Can Be Trained on More Data: This suggests global RMs are more amenable to training on wider data distributions than local RMs/policies. The authors generate a more concentrated dataset using samples only from the SFT policy and find that online DPO still improves performance.
- H5: Global Reward Models Generalize Better Out-of-Distribution (OOD): This suggests that online PFT techniques are better at maximizing rewards when the learned reward model's peak falls outside the preference data's support. The authors train local RMs without regularization and compare them to global and DPO RMs, finding that better in-distribution margins correlate with better OOD generalization.
5. Generation-Verification Gap and Proper Policy Learning
The paper proposes an alternative hypothesis based on the idea that, for many problems, verification is simpler than generation. They hypothesize that the first step of online fine-tuning finds a relatively simple reward model, and the second step finds a soft-optimal policy for that reward model. This reduces the search space compared to offline fine-tuning, which searches over all possible policies.
- H6: Online PFT is Proper Policy Learning: This states that offline FT solves a harder, improper learning problem, while online FT avoids it by performing proper learning over a restricted policy space.
The authors provide evidence supporting H6:
- Experiments show that using a significantly smaller global RM leads to nearly identical Best-Of-N (BoN) performance as using an RM the same size as the policy.
- They design experiments to close the generation-verification gap, such as using two-word summaries or the ROUGE-L metric as the reward function. In these cases, the gap between online and offline PFT diminishes or disappears.
6. Isomorphisms Are Not a Two-Way Street
The paper addresses the question of why it's harder to learn a policy than a reward model if they are isomorphic in soft RL. They explain that while isomorphisms exist, the mapping from reward to policy requires solving a hard reinforcement learning problem. In contrast, the mapping between policy and Q-function (a local reward model) is more direct. This suggests that optimizing over local reward models like DPO doesn't escape the statistical difficulty of directly learning the generator.