- The paper shows that selecting a baseline below the expected reward yields stable, monotonic policy improvement while preventing premature support collapse.
- It rigorously analyzes the impact of baseline choice, validated by bandit experiments and large-scale LLM fine-tuning tests that reveal a critical phase transition.
- The study provides actionable guidance for off-policy reinforcement learning, recommending conservative baseline settings to effectively balance positive and negative rewards.
Asymmetric REINFORCE for Off-Policy Reinforcement Learning: Balancing Positive and Negative Rewards
The paper introduces and analyzes Asymmetric REINFORCE (AsymRE), a simple yet theoretically grounded off-policy reinforcement learning (RL) algorithm, with a particular focus on its application to LLM fine-tuning. The central insight is that, in off-policy RL, the choice of baseline in the REINFORCE objective fundamentally alters both the training dynamics and the asymptotic behavior of the learned policy. The authors provide a rigorous theoretical analysis, empirical validation in bandit settings, and large-scale experiments with LLMs, demonstrating the practical implications of their findings.
Theoretical Contributions
The AsymRE algorithm is defined by the objective:
J(π)=Ey∼μ[logπ(y)(r(y)−V)]
where μ is the behavior policy, π is the current policy, r(y) is the reward, and V is a tunable baseline. Unlike on-policy REINFORCE, where the baseline serves only to reduce variance, in the off-policy setting the baseline V introduces a bias that can be leveraged to control the emphasis on positive versus negative rewards.
The authors provide a detailed analysis in the tabular setting, showing that:
- If V<Vμ (the expected reward under μ), AsymRE converges to a policy that improves upon μ and maintains broad support.
- If V≥Vμ, a phase transition occurs: the policy's support collapses, often to a singleton, leading to premature convergence and loss of diversity.
- Iterative application of AsymRE with V<Vμ yields monotonic policy improvement, with the mass of the policy concentrating exponentially fast on the optimal set.
This analysis reveals a critical asymmetry: off-policy updates benefit from focusing on positive rewards (i.e., using a lower baseline), whereas negative rewards from off-policy data are less informative and can be detrimental if overemphasized.
Empirical Validation
Bandit Experiments
In a controlled multi-armed bandit setting, the authors demonstrate that:
- As V approaches Vμ from below, the expected reward of the learned policy increases, but the policy's support shrinks.
- Crossing the threshold V=Vμ leads to a sudden collapse in support and diversity, confirming the theoretical phase transition.
- Policy improvement schemes with V<Vμ yield consistent improvement, while V≥Vμ results in suboptimal, deterministic policies.
LLM Fine-Tuning
The AsymRE objective is adapted for LLMs by using a context-corrected baseline:
Ex∼D,y∼μ(⋅∣x)[logπ(y∣x)(r(y,x)−Vμ(⋅∣x)−δV)]
where x is a prompt, and δV is a small conservative correction.
Key findings from experiments with Llama 8B and Qwen 3B on the MATH dataset:
- Training is stable and performance improves as δV approaches $0$ from below.
- When δV≥0, both training and test accuracy collapse catastrophically, and the entropy of the policy drops, indicating loss of diversity.
- A small negative δV (e.g., −0.1) consistently prevents collapse and yields more robust training.
Implications and Discussion
The results have several important implications for both theory and practice:
- Off-Policy RL for LLMs: The findings provide a principled approach to off-policy RL in LLM fine-tuning, where strict on-policy data collection is often infeasible due to computational and engineering constraints.
- Baseline Selection: The baseline in off-policy REINFORCE is not merely a variance reduction tool but a critical hyperparameter that governs the trade-off between learning from positive and negative examples. Conservative (lower) baselines are preferable in off-policy settings.
- Policy Diversity: Maintaining policy diversity is essential in high-dimensional, multi-task settings such as LLMs. Overemphasis on negative rewards (high baseline) can lead to overfitting and poor generalization.
- Practical Guidance: For practitioners, the recommendation is to set the baseline slightly below the empirical average reward of the behavior policy, ensuring stable and effective off-policy learning.
Future Directions
The paper suggests several avenues for further research:
- Extending the analysis to more sophisticated objectives incorporating importance sampling or KL regularization.
- Quantifying the computational and sample efficiency gains from reusing off-policy data in large-scale LLM training.
- Investigating the interplay between baseline selection and other regularization techniques in RLHF and related alignment methods.
Conclusion
AsymRE offers a theoretically sound and practically effective method for off-policy RL, particularly in the context of LLM alignment and fine-tuning. The work clarifies the nuanced role of the baseline in off-policy policy gradient methods and provides actionable insights for stable and efficient RL-based training of large models. The phase transition phenomenon identified here is especially relevant for practitioners seeking to balance performance and diversity in real-world RL applications.