Asymmetric REINFORCE for off-Policy Reinforcement Learning: Balancing positive and negative rewards (2506.20520v1)

Published 25 Jun 2025 in cs.LG and cs.CL

Abstract: Reinforcement learning (RL) is increasingly used to align LLMs. Off-policy methods offer greater implementation simplicity and data efficiency than on-policy techniques, but often result in suboptimal performance. In this work, we study the intermediate range of algorithms between off-policy RL and supervised fine-tuning by analyzing a simple off-policy REINFORCE algorithm, where the advantage is defined as $A=r-V$, with $r$ a reward and $V$ some tunable baseline. Intuitively, lowering $V$ emphasizes high-reward samples, while raising it penalizes low-reward ones more heavily. We first provide a theoretical analysis of this off-policy REINFORCE algorithm, showing that when the baseline $V$ lower-bounds the expected reward, the algorithm enjoys a policy improvement guarantee. Our analysis reveals that while on-policy updates can safely leverage both positive and negative signals, off-policy updates benefit from focusing more on positive rewards than on negative ones. We validate our findings experimentally in a controlled stochastic bandit setting and through fine-tuning state-of-the-art LLMs on reasoning tasks.

Summary

The paper shows that selecting a baseline below the expected reward yields stable, monotonic policy improvement while preventing premature support collapse.
It rigorously analyzes the impact of baseline choice, validated by bandit experiments and large-scale LLM fine-tuning tests that reveal a critical phase transition.
The study provides actionable guidance for off-policy reinforcement learning, recommending conservative baseline settings to effectively balance positive and negative rewards.

Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards

The paper introduces and analyzes Asymmetric REINFORCE (AsymRE), a simple yet theoretically grounded off-policy reinforcement learning (RL) algorithm, with a particular focus on its application to LLM fine-tuning. The central insight is that, in off-policy RL, the choice of baseline in the REINFORCE objective fundamentally alters both the training dynamics and the asymptotic behavior of the learned policy. The authors provide a rigorous theoretical analysis, empirical validation in bandit settings, and large-scale experiments with LLMs, demonstrating the practical implications of their findings.

Theoretical Contributions

The AsymRE algorithm is defined by the objective:

$J(\pi) = \mathbb{E}_{y \sim \mu} \left[ \log \pi(y) (r(y) - V) \right]$

where $\mu$ is the behavior policy, $\pi$ is the current policy, $r(y)$ is the reward, and $V$ is a tunable baseline. Unlike on-policy REINFORCE, where the baseline serves only to reduce variance, in the off-policy setting the baseline $V$ introduces a bias that can be leveraged to control the emphasis on positive versus negative rewards.

The authors provide a detailed analysis in the tabular setting, showing that:

If $V < V^\mu$ (the expected reward under $\mu$ ), AsymRE converges to a policy that improves upon $\mu$ and maintains broad support.
If $V \geq V^\mu$ , a phase transition occurs: the policy's support collapses, often to a singleton, leading to premature convergence and loss of diversity.
Iterative application of AsymRE with $V < V^\mu$ yields monotonic policy improvement, with the mass of the policy concentrating exponentially fast on the optimal set.

This analysis reveals a critical asymmetry: off-policy updates benefit from focusing on positive rewards (i.e., using a lower baseline), whereas negative rewards from off-policy data are less informative and can be detrimental if overemphasized.

Empirical Validation

Bandit Experiments

In a controlled multi-armed bandit setting, the authors demonstrate that:

As $V$ approaches $V^\mu$ from below, the expected reward of the learned policy increases, but the policy's support shrinks.
Crossing the threshold $V = V^\mu$ leads to a sudden collapse in support and diversity, confirming the theoretical phase transition.
Policy improvement schemes with $V < V^\mu$ yield consistent improvement, while $V \geq V^\mu$ results in suboptimal, deterministic policies.

LLM Fine-Tuning

The AsymRE objective is adapted for LLMs by using a context-corrected baseline:

$\mathbb{E}_{x \sim \mathcal{D}, y \sim \mu(\cdot|x)} \left[ \log \pi(y|x) (r(y, x) - V^{\mu(\cdot|x)} - \delta V) \right]$

where $x$ is a prompt, and $\delta V$ is a small conservative correction.

Key findings from experiments with Llama 8B and Qwen 3B on the MATH dataset:

Training is stable and performance improves as $\delta V$ approaches $0$ from below.
When $\delta V \geq 0$ , both training and test accuracy collapse catastrophically, and the entropy of the policy drops, indicating loss of diversity.
A small negative $\delta V$ (e.g., $-0.1$ ) consistently prevents collapse and yields more robust training.

Implications and Discussion

The results have several important implications for both theory and practice:

Off-Policy RL for LLMs: The findings provide a principled approach to off-policy RL in LLM fine-tuning, where strict on-policy data collection is often infeasible due to computational and engineering constraints.
Baseline Selection: The baseline in off-policy REINFORCE is not merely a variance reduction tool but a critical hyperparameter that governs the trade-off between learning from positive and negative examples. Conservative (lower) baselines are preferable in off-policy settings.
Policy Diversity: Maintaining policy diversity is essential in high-dimensional, multi-task settings such as LLMs. Overemphasis on negative rewards (high baseline) can lead to overfitting and poor generalization.
Practical Guidance: For practitioners, the recommendation is to set the baseline slightly below the empirical average reward of the behavior policy, ensuring stable and effective off-policy learning.

Future Directions

The paper suggests several avenues for further research:

Extending the analysis to more sophisticated objectives incorporating importance sampling or KL regularization.
Quantifying the computational and sample efficiency gains from reusing off-policy data in large-scale LLM training.
Investigating the interplay between baseline selection and other regularization techniques in RLHF and related alignment methods.

Conclusion

AsymRE offers a theoretically sound and practically effective method for off-policy RL, particularly in the context of LLM alignment and fine-tuning. The work clarifies the nuanced role of the baseline in off-policy policy gradient methods and provides actionable insights for stable and efficient RL-based training of large models. The phase transition phenomenon identified here is especially relevant for practitioners seeking to balance performance and diversity in real-world RL applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/arnal_charles/status/1938678207621779727

https://twitter.com/fly51fly/status/1938358945426640964

https://twitter.com/gaur_manu/status/1940794944140280228