Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards (2506.20520v1)

Published 25 Jun 2025 in cs.LG and cs.CL

Abstract: Reinforcement learning (RL) is increasingly used to align LLMs. Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

Summary

  • The paper shows that selecting a baseline below the expected reward yields stable, monotonic policy improvement while preventing premature support collapse.
  • It rigorously analyzes the impact of baseline choice, validated by bandit experiments and large-scale LLM fine-tuning tests that reveal a critical phase transition.
  • The study provides actionable guidance for off-policy reinforcement learning, recommending conservative baseline settings to effectively balance positive and negative rewards.

Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards

The paper introduces and analyzes Asymmetric REINFORCE (AsymRE), a simple yet theoretically grounded off-policy reinforcement learning (RL) algorithm, with a particular focus on its application to LLM fine-tuning. The central insight is that, in off-policy RL, the choice of baseline in the REINFORCE objective fundamentally alters both the training dynamics and the asymptotic behavior of the learned policy. The authors provide a rigorous theoretical analysis, empirical validation in bandit settings, and large-scale experiments with LLMs, demonstrating the practical implications of their findings.

Theoretical Contributions

The AsymRE algorithm is defined by the objective:

J(π)=Eyμ[logπ(y)(r(y)V)]J(\pi) = \mathbb{E}_{y \sim \mu} \left[ \log \pi(y) (r(y) - V) \right]

where μ\mu is the behavior policy, π\pi is the current policy, r(y)r(y) is the reward, and VV is a tunable baseline. Unlike on-policy REINFORCE, where the baseline serves only to reduce variance, in the off-policy setting the baseline VV introduces a bias that can be leveraged to control the emphasis on positive versus negative rewards.

The authors provide a detailed analysis in the tabular setting, showing that:

  • If V<VμV < V^\mu (the expected reward under μ\mu), AsymRE converges to a policy that improves upon μ\mu and maintains broad support.
  • If VVμV \geq V^\mu, a phase transition occurs: the policy's support collapses, often to a singleton, leading to premature convergence and loss of diversity.
  • Iterative application of AsymRE with V<VμV < V^\mu yields monotonic policy improvement, with the mass of the policy concentrating exponentially fast on the optimal set.

This analysis reveals a critical asymmetry: off-policy updates benefit from focusing on positive rewards (i.e., using a lower baseline), whereas negative rewards from off-policy data are less informative and can be detrimental if overemphasized.

Empirical Validation

Bandit Experiments

In a controlled multi-armed bandit setting, the authors demonstrate that:

  • As VV approaches VμV^\mu from below, the expected reward of the learned policy increases, but the policy's support shrinks.
  • Crossing the threshold V=VμV = V^\mu leads to a sudden collapse in support and diversity, confirming the theoretical phase transition.
  • Policy improvement schemes with V<VμV < V^\mu yield consistent improvement, while VVμV \geq V^\mu results in suboptimal, deterministic policies.

LLM Fine-Tuning

The AsymRE objective is adapted for LLMs by using a context-corrected baseline:

ExD,yμ(x)[logπ(yx)(r(y,x)Vμ(x)δV)]\mathbb{E}_{x \sim \mathcal{D}, y \sim \mu(\cdot|x)} \left[ \log \pi(y|x) (r(y, x) - V^{\mu(\cdot|x)} - \delta V) \right]

where xx is a prompt, and δV\delta V is a small conservative correction.

Key findings from experiments with Llama 8B and Qwen 3B on the MATH dataset:

  • Training is stable and performance improves as δV\delta V approaches $0$ from below.
  • When δV0\delta V \geq 0, both training and test accuracy collapse catastrophically, and the entropy of the policy drops, indicating loss of diversity.
  • A small negative δV\delta V (e.g., 0.1-0.1) consistently prevents collapse and yields more robust training.

Implications and Discussion

The results have several important implications for both theory and practice:

  • Off-Policy RL for LLMs: The findings provide a principled approach to off-policy RL in LLM fine-tuning, where strict on-policy data collection is often infeasible due to computational and engineering constraints.
  • Baseline Selection: The baseline in off-policy REINFORCE is not merely a variance reduction tool but a critical hyperparameter that governs the trade-off between learning from positive and negative examples. Conservative (lower) baselines are preferable in off-policy settings.
  • Policy Diversity: Maintaining policy diversity is essential in high-dimensional, multi-task settings such as LLMs. Overemphasis on negative rewards (high baseline) can lead to overfitting and poor generalization.
  • Practical Guidance: For practitioners, the recommendation is to set the baseline slightly below the empirical average reward of the behavior policy, ensuring stable and effective off-policy learning.

Future Directions

The paper suggests several avenues for further research:

  • Extending the analysis to more sophisticated objectives incorporating importance sampling or KL regularization.
  • Quantifying the computational and sample efficiency gains from reusing off-policy data in large-scale LLM training.
  • Investigating the interplay between baseline selection and other regularization techniques in RLHF and related alignment methods.

Conclusion

AsymRE offers a theoretically sound and practically effective method for off-policy RL, particularly in the context of LLM alignment and fine-tuning. The work clarifies the nuanced role of the baseline in off-policy policy gradient methods and provides actionable insights for stable and efficient RL-based training of large models. The phase transition phenomenon identified here is especially relevant for practitioners seeking to balance performance and diversity in real-world RL applications.