Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs (2503.14286v2)

Published 18 Mar 2025 in cs.LG

Abstract: We propose a new algorithm for fine-tuning LLMs using reinforcement learning. Tapered Off-Policy REINFORCE (TOPR) uses an asymmetric, tapered variant of importance sampling to speed up learning while maintaining stable learning dynamics, even without the use of KL regularization. TOPR can be applied in a fully offline fashion, allows the handling of positive and negative examples in a unified framework, and benefits from the implementational simplicity that is typical of Monte Carlo algorithms. We demonstrate the effectiveness of our approach with a series of experiments on the GSM8K and MATH reasoning benchmarks, finding performance gains for training both a model for solution generation and as a generative verifier. We show that properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency, all the while avoiding the ``wasted inference'' that comes with discarding negative examples. We find that this advantage persists over multiple iterations of training and can be amplified by dataset curation techniques, enabling us to match 70B-parameter model performance with 8B LLMs. As a corollary to this work, we find that REINFORCE's baseline parameter plays an important and unexpected role in defining dataset composition in the presence of negative examples, and is consequently critical in driving off-policy performance.

Summary

The paper introduces Tapered Off-Policy REINFORCE (TOPR), an algorithm using asymmetric tapered importance sampling to stabilize off-policy reinforcement learning for LLMs, particularly with negative rewards.
TOPR effectively integrates both positive and negative training examples, improving sample efficiency and achieving strong performance on challenging reasoning benchmarks like GSM8K and MATH.
The method's unified framework for positive and negative data, along with an optimal baseline strategy, enables stable learning dynamics and competitive results comparable to much larger models.

Overview

The paper "Tapered Off-Policy REINFORCE: Stable and efficient reinforcement learning for LLMs" (2503.14286) introduces TOPR, a fine-tuning algorithm for LLMs that achieves enhanced stability in off-policy learning. It addresses the classical instability of REINFORCE in scenarios involving negative rewards and heterogeneous datasets. The algorithm innovates by employing tapered importance sampling to manage positive and negative training examples in a unified framework, yielding significant improvements in sample efficiency and overall performance on challenging reasoning benchmarks such as GSM8K and MATH.

Stable Off-Policy Learning Through Tapered Importance Sampling

At the core of TOPR is an asymmetric importance sampling mechanism that differentiates between positive and negative trajectories. Traditional REINFORCE suffers when confronted with negative rewards off-policy, since large importance ratios can lead to destructive updates. TOPR mitigates this by applying a tapered form of clipping on the importance weights for trajectories with negative rewards. In practice, for a trajectory τ with reward R(τ), the gradient takes the form:

$\nabla J_{topr}(θ) = \sum_{\tau: R(τ) \geq 0} μ(τ) R(τ) \nabla \log π(θ; τ) + \sum_{\tau: R(τ) < 0} μ(τ) \left[\frac{π(θ; τ)}{μ(τ)}\right]_{0}^{1} R(τ) \nabla \log π(θ; τ)$

Here, the importance ratio $π(θ; τ)/μ(τ)$ is clamped in the range [0, 1] for negative examples. This asymmetric treatment allows the model to maintain sufficient positive update magnitudes while tempering the potential destabilizing effect of negative trajectories, thereby producing stable learning dynamics without resorting to additional regularizers like KL divergence.

Unified Framework for Positive and Negative Examples

A central contribution of TOPR is its ability to integrate both positive and negative examples into the learning process. Many conventional methods discard negative examples or treat them separately, leading to inefficiencies. TOPR, by contrast, leverages negative trajectories via the tapered importance sampling technique. Doing so avoids the “wasted inference” associated with discarding valuable negative feedback. Empirically, this approach has been shown to drive test-time accuracy and data efficiency significantly higher than traditional methods.

Furthermore, the paper highlights a novel role for the baseline parameter traditionally used for variance reduction in REINFORCE. In TOPR, the baseline influences dataset composition by modulating the effective positive rate. This contribution is particularly important in off-policy settings where the distributional mismatch between the behavior policy μ and the target policy π can lead to suboptimal utilization of negative examples. The paper identifies an optimal effective positive rate of approximately 10-20%, where fine-tuning performance reaches its peak.

Empirical Evaluation on GSM8K and MATH

The empirical results presented in the paper are extensive and technical. Key observations include:

Performance Gains: TOPR consistently outperforms naive REINFORCE, as well as more sophisticated baselines such as PPO and DPO in off-policy configurations. For instance, on GSM8K and MATH benchmarks, TOPR achieved comparable accuracy to models with an order of magnitude more parameters (70B vs. 8B), underscoring its efficiency in leveraging available data.
Impact of Negative Examples: Ablation studies emphasize that incorporating negative examples is crucial. Exclusion of negative trajectories results in a marked performance degradation, thereby validating the design choices around tapered importance sampling. It is also noted that these negative examples, when appropriately weighted, contribute to smoother convergence and enhanced model generalization over multiple fine-tuning iterations.
Multi-Iteration Training: TOPR demonstrates robustness in iterated fine-tuning regimes. Over successive training cycles, the performance improvement continues, suggesting that the learned policy benefits cumulatively from both positive and negative signals. This iterative effectiveness makes TOPR a viable choice in practical settings where models must be continually updated with new feedback.
Dataset Curation Techniques: The introduction of dataset balancing techniques, such as the Anna Karenina sampling method, proves to be an effective strategy in further amplifying TOPR’s advantages. This technique assists in maintaining an optimal mixture of positive and negative examples throughout training, which is critical in realizing the full potential of the algorithm.

Implementation Considerations and Practical Deployment

For practitioners seeking to implement TOPR in a real-world LLM fine-tuning pipeline, several practical considerations are noteworthy:

Computational Overhead: Although TOPR utilizes Monte Carlo sampling methods typical of REINFORCE, the algorithm's asymmetric importance sampling does introduce a modest computational overhead during importance weight calculation. However, this is counterbalanced by the improved sample efficiency, effectively reducing the number of training iterations required.
Parallelization and Scaling: Given the algorithm’s reliance on off-policy data, careful design is required to ensure that the data collection and the computation of importance ratios are efficiently parallelized. Utilizing modern GPU clusters and distributed training frameworks can help manage the increased computational demand.
Hyperparameter Tuning: Key hyperparameters include the clipping range for negative examples and the baseline parameter for variance reduction. Sensitivity analyses indicate that the optimal configuration can vary with the dataset composition and the specific reasoning task. Rigorous hyperparameter sweeps, preferably leveraging Bayesian optimization or grid search techniques, are recommended during implementation.
Integration with LLMs: The implementation of TOPR is well-suited to transformer-based architectures. Practitioners should replace the conventional reward formulation used in supervised fine-tuning with a structured reward signal reflective of task-specific performance metrics. The modular nature of TOPR also facilitates its application in reinforcement learning settings where both generative and verification tasks are pursued (e.g., solution generation versus generative verification).

Conclusion

TOPR represents a technically rigorous advancement in the domain of off-policy reinforcement learning for LLMs, addressing fundamental stability issues associated with negative rewards. By leveraging tapered importance sampling and integrating both positive and negative examples into the learning process, TOPR achieves significant performance improvements on benchmarks such as GSM8K and MATH. The method’s ability to match the performance of significantly larger models using relatively smaller architectures further underscores its practical applicability. Future implementations of TOPR should focus on optimizing hyperparameters and scaling the approach in distributed environments to fully harness its potential in high-stakes LLM fine-tuning tasks.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/marcgbellemare/status/1902402681718583726

https://twitter.com/marcgbellemare/status/1919912375274373268

https://twitter.com/_AndrewZhao/status/1924094734269591664

https://twitter.com/arxivsanitybot/status/1902550373086498908

https://twitter.com/lae_teo/status/1929749471136477579

HackerNews

Tapered Off-Policy Reinforce: Stable and Efficient RL for LLMs (2 points, 0 comments)