Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

All Roads Lead to Likelihood: The Value of Reinforcement Learning in Fine-Tuning (2503.01067v1)

Published 3 Mar 2025 in cs.LG

Abstract: From a first-principles perspective, it may seem odd that the strongest results in foundation model fine-tuning (FT) are achieved via a relatively complex, two-stage training procedure. Specifically, one first trains a reward model (RM) on some dataset (e.g. human preferences) before using it to provide online feedback as part of a downstream reinforcement learning (RL) procedure, rather than directly optimizing the policy parameters on the dataset via offline maximum likelihood estimation. In fact, from an information-theoretic perspective, we can only lose information via passing through a reward model and cannot create any new information via on-policy sampling. To explain this discrepancy, we scrutinize several hypotheses on the value of RL in FT through both theoretical and empirical lenses. Of the hypotheses considered, we find the most support for the explanation that on problems with a generation-verification gap, the combination of the ease of learning the relatively simple RM (verifier) from the preference data, coupled with the ability of the downstream RL procedure to then filter its search space to the subset of policies (generators) that are optimal for relatively simple verifiers is what leads to the superior performance of online FT.

This paper investigates why a complex, two-stage training process involving a reward model and reinforcement learning is more effective for fine-tuning foundation models than directly optimizing the policy using maximum likelihood estimation.

Here's a breakdown of the key aspects:

1. Background and Motivation

The paper starts by addressing a fundamental question: Why do today's most powerful models rely on a two-stage process for preference fine-tuning (PFT), also known as reinforcement learning from human feedback (RLHF)? In this process, a reward model (RM) is first trained on a dataset of preferences. Then, this reward model guides a reinforcement learning (RL) procedure to optimize the policy's parameters. This seems odd because directly maximizing the likelihood of preferred completions relative to dis-preferred ones using maximum likelihood estimation (MLE) should, in theory, be more efficient.

The authors highlight that despite the information loss that occurs when using a reward model (as suggested by the data processing inequality), the two-stage online techniques consistently outperform simpler offline approaches. This performance gap motivates the central question: What is the value of this two-stage interactive fine-tuning if the ultimate goal is simply to maximize data likelihood?

2. Theoretical Analysis: Information Geometry of Fine-Tuning

The paper begins with a theoretical analysis, using information geometry, to understand the relationship between online and offline PFT.

  • Markov Decision Process (MDP): The authors model the problem using a finite-horizon, reward-free MDP. Key elements include:
    • XX: Set of initial states (prompts)
    • P0P_0: Distribution of initial states
    • AA: Action space (set of tokens)
    • SS: State space (set of partial generations)
    • T(ss,a)T(s'|s, a): Deterministic dynamics, where ss' is the next state after taking action aa in state ss.
    • HH: Horizon or maximum generation length
    • π:SΔ(A)\pi : S \rightarrow \Delta(A): Policy that maps a prefix to a distribution over next tokens
    • ξ\xi: Trajectory generated by sampling an initial state and then sampling from the policy HH times.
    • Pπ(ξ)P_{\pi}(\xi): Probability of sampling a trajectory ξ\xi under policy π\pi.
    • Π\Pi: Set of policies.
    • D={(ξi,i)}i=1ND = \{(\xi_i, \succ_i)\}_{i=1}^N: Dataset of trajectory-level preferences, where ξiiξi\xi_i \succ_i \xi'_i indicates that trajectory ξi\xi_i is preferred over ξi\xi'_i
    • PDP_D: Uniform distribution over the dataset DD
    • Ξ\Xi: Full space of trajectories
    • Ξs0:h\Xi_{s_{0:h}}: Set of trajectories with some prefix s0:hs_{0:h}
    • πrefΠ\pi_{ref} \in \Pi: Reference policy
  • Reward Models: The paper distinguishes between global and local reward models. Global reward models assign a scalar value to an entire trajectory, while local reward models decompose the reward into a sum of per-token rewards (related to policy log probabilities).
    • Π\Pi: Set of policies
    • RR: Set of reward models, with r:ΞRr : \Xi \rightarrow \mathbb{R} for all rRr \in R.

    The set of local RMs is defined as:

    R(Π)={r(ξ)=h=0Hlogπ(ahsh):πΠ}R(\Pi) = \{r(\xi) = \sum_{h=0}^{H} -log \pi(a_h|s_h) : \pi \in \Pi \}

    , where r(ξ)r(\xi) is the reward for trajectory ξ\xi, π(ahsh)\pi(a_h|s_h) is the probability of taking action aha_h in state shs_h under policy π\pi, and HH is the horizon.

  • Fine-Tuning Objective: The paper formulates fine-tuning as a reverse KL-regularized policy optimization problem:

    π=argminπΠ DKL(PDPπ)+βDKL(PπPπref)\pi^* = \underset{\pi \in \Pi}{\operatorname{argmin}} \ D_{KL}(P_D||P_{\pi}) + \beta D_{KL}(P_{\pi}||P_{\pi_{ref}})

    • DKL(PDPπ)D_{KL}(P_D||P_{\pi}): Forward KL divergence measuring how well the learned policy π\pi matches the data distribution PDP_D.
    • DKL(PπPπref)D_{KL}(P_{\pi}||P_{\pi_{ref}}): Reverse KL divergence ensuring the learned policy π\pi stays close to the reference policy πref\pi_{ref}.
    • β\beta: Regularization coefficient.

    For simplicity, the authors set β=1\beta = 1 and replace the KL regularization with entropy regularization, leading to:

    π=argminπΠ DKL(PDPπ)H(π)\pi^* = \underset{\pi \in \Pi}{\operatorname{argmin}} \ D_{KL}(P_D||P_{\pi}) - H(\pi) * H(π)=Eξπ[hlogπ(ahsh)]H(\pi) = \mathbb{E}_{\xi \sim \pi} [-\sum_{h} log \pi (a_h|s_h)] is the (causal) entropy of the policy.

  • Maximum Likelihood Estimation (MLE): The paper shows how MLE can be used to fit both global reward models and policies. Under the Bradley-Terry (BT) model of preferences, the probability of preferring trajectory ξ1\xi_1 over ξ2\xi_2 given the same initial state s0s_0 is:

    PBT(ξ1ξ2s0)=σ(r(ξ1)r(ξ2))P_{BT}(\xi_1 \succ \xi_2 | s_0) = \sigma (r(\xi_1) - r(\xi_2))

    • \succ: means "preferred to"
    • σ\sigma: Sigmoid function

    The global RM is fit via MLE:

    r^mle=argminrR DKL(PDPPT)\hat{r}_{mle} = \underset{r \in R}{\operatorname{argmin}} \ D_{KL}(P_D||P_{P_T})

    =argmaxrRi=1Nlog σ(r(ξi+)r(ξi))= \underset{r \in R}{\operatorname{argmax}} \sum_{i=1}^N log \ \sigma (r(\xi_i^+) - r(\xi_i^-)) * ξi+\xi_i^+: Preferred trajectory in pair ii * ξi\xi_i^-: Dis-preferred trajectory in pair ii

    Similarly, a policy can be fit via MLE by substituting the sum of log probabilities for rπr_{\pi}:

    π^mle=argminπR(Π) DKL(PDPPT)\hat{\pi}_{mle} = \underset{\pi \in R(\Pi)}{\operatorname{argmin}} \ D_{KL}(P_D||P_{P_T})

    =argmaxπR(Π)i=1Nlog σ(π(ξi+)π(ξi))= \underset{\pi \in R(\Pi)}{\operatorname{argmax}} \sum_{i=1}^N log \ \sigma (\pi(\xi_i^+) - \pi(\xi_i^-))

    =argmaxπΠi=1Nlog σ(h=0Hlogπ(ah,i+sh,i+)h=0Hlogπ(ah,ish,i))= \underset{\pi \in \Pi}{\operatorname{argmax}} \sum_{i=1}^N log \ \sigma (\sum_{h=0}^H log \pi(a_{h,i}^+ | s_{h,i}^+) - \sum_{h=0}^H log \pi(a_{h,i}^- | s_{h,i}^-))

  • Maximum Entropy: Given a global reward model rr, the soft-optimal policy π\pi^* is computed as:

    π=argmaxπΠEξπ[r(ξ)]+H(π)\pi^* = \underset{\pi \in \Pi}{\operatorname{argmax}} \mathbb{E}_{\xi \sim \pi} [r(\xi)] + H(\pi)

    The paper proves that solving this soft RL problem is equivalent to a reverse KL projection from PP^* (the distribution induced by the soft-optimal policy) onto the set of policy-induced trajectory distributions Π\Pi.

  • Equivalence Theorems: The paper presents key equivalence theorems:

    • Theorem 2.2: If the reward and policy classes are isomorphic (i.e., R=R(Π)R = R(\Pi)), then RLHF is equivalent to MLE.
    • Theorem 2.3: RLHF is equivalent to Direct Preference Optimization (DPO) when R=R(Π)R = R(\Pi).
    • Theorem 2.4: A variant of the online SPIN algorithm (which is online DPO with supervised fine-tuning data) is equivalent to vanilla offline SFT.

3. Empirical Investigation: The Value of RL in Fine-Tuning

To reconcile the theoretical equivalences with empirical observations, the authors conduct a series of controlled experiments. They focus on learning to summarize from preference feedback using the Pythia series of models. They use DPO as the optimizer for both offline and online PFT to control for confounding variables.

The experiments reveal a significant performance gap between online and offline DPO, even when using the same SFT checkpoint and data for training. This contradicts the theoretical results from Section 2.

4. Hypotheses for the Online-Offline Gap

The authors explore several hypotheses to explain the discrepancy between theory and practice:

  • H1: Intrinsic Value of Online Samples: The idea that feedback on samples more likely under the current policy is more useful. The authors argue against this, stating that on-policy data is redundant from an information-theoretic perspective.
  • H2: Failure of Offline PFT Regularization to Reference Policy: Offline PFT algorithms may require stronger coverage conditions than online approaches because they struggle to effectively regularize to the reference policy. The authors provide evidence against this, noting that simply adding a reverse KL penalty to DPO doesn't close the gap.
  • H3: Relative Ease of Online PFT Optimization: The possibility that offline PFT faces a harder optimization problem. The authors argue against this because they use the same optimizer (DPO) for both online and offline settings. They also explore a refined hypothesis related to computational-statistical gaps but find little support for it through prompt augmentation experiments.
  • H4: Global Reward Models Can Be Trained on More Data: This suggests global RMs are more amenable to training on wider data distributions than local RMs/policies. The authors generate a more concentrated dataset using samples only from the SFT policy and find that online DPO still improves performance.
  • H5: Global Reward Models Generalize Better Out-of-Distribution (OOD): This suggests that online PFT techniques are better at maximizing rewards when the learned reward model's peak falls outside the preference data's support. The authors train local RMs without regularization and compare them to global and DPO RMs, finding that better in-distribution margins correlate with better OOD generalization.

5. Generation-Verification Gap and Proper Policy Learning

The paper proposes an alternative hypothesis based on the idea that, for many problems, verification is simpler than generation. They hypothesize that the first step of online fine-tuning finds a relatively simple reward model, and the second step finds a soft-optimal policy for that reward model. This reduces the search space compared to offline fine-tuning, which searches over all possible policies.

  • H6: Online PFT is Proper Policy Learning: This states that offline FT solves a harder, improper learning problem, while online FT avoids it by performing proper learning over a restricted policy space.

The authors provide evidence supporting H6:

  • Experiments show that using a significantly smaller global RM leads to nearly identical Best-Of-N (BoN) performance as using an RM the same size as the policy.
  • They design experiments to close the generation-verification gap, such as using two-word summaries or the ROUGE-L metric as the reward function. In these cases, the gap between online and offline PFT diminishes or disappears.

6. Isomorphisms Are Not a Two-Way Street

The paper addresses the question of why it's harder to learn a policy than a reward model if they are isomorphic in soft RL. They explain that while isomorphisms exist, the mapping from reward to policy requires solving a hard reinforcement learning problem. In contrast, the mapping between policy and Q-function (a local reward model) is more direct. This suggests that optimizing over local reward models like DPO doesn't escape the statistical difficulty of directly learning the generator.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Gokul Swamy (26 papers)
  2. Sanjiban Choudhury (62 papers)
  3. Wen Sun (124 papers)
  4. Zhiwei Steven Wu (143 papers)
  5. J. Andrew Bagnell (64 papers)
Youtube Logo Streamline Icon: https://streamlinehq.com