- The paper introduces a systematic framework for designing KL-regularized policy gradient algorithms aimed at improving LLM reasoning in off-policy reinforcement learning.
- It derives both fully differentiable and REINFORCE-style surrogate loss functions for multiple KL divergence formulations, offering clear guidelines for efficient policy optimization.
- Extensive experiments on mathematical reasoning benchmarks validate the framework's stability and competitive performance against strong baseline methods.
This paper introduces Regularized Policy Gradient (RPG), a systematic framework for designing and analyzing KL-regularized policy gradient algorithms specifically for enhancing the reasoning capabilities of LLMs in an online, off-policy reinforcement learning setting (2505.17508).
The core idea of RPG is to systematically explore the design space of KL regularization in policy gradient objectives, considering different formulations of KL divergence and their corresponding gradient estimators and surrogate loss functions suitable for off-policy optimization. The framework operates iteratively, using the policy from the previous iteration (πold) as the reference for KL regularization and as the behavior policy for sampling data.
Key aspects and contributions of the RPG framework include:
- Systematic Derivation of Gradients and Losses: The paper derives off-policy policy gradients and corresponding surrogate loss functions for objectives that combine the expected reward with a KL regularization term. This is done for:
- Forward KL divergence ($\KL(\pi_{\text{old}} \| \pi_\theta)$).
- Reverse KL divergence ($\KL(\pi_\theta \| \pi_{\text{old}})$).
- Unnormalized versions of both Forward and Reverse KL (UKL and URKL), which account for potential differences in total mass between distributions and relate to estimators like the k3 estimator.
- Two Styles of Surrogate Loss Functions: For each KL formulation, the paper provides derivations for two types of surrogate loss functions for gradient-based optimization:
- Fully Differentiable Losses: These are derived such that minimizing the loss directly yields the negative of the target objective's gradient (Table 1). This is a common approach in modern off-policy RL like PPO, where the loss implicitly accounts for importance sampling and the KL term.
- REINFORCE-Style Losses: These losses are structured similarly to the classic REINFORCE algorithm, utilizing the stop-gradient ($\SG$) operator (Table 2, Appendix C). They are designed such that automatic differentiation yields the correct policy gradient estimate by effectively treating the "advantage" term (which includes reward and KL components) as a fixed coefficient for the ∇logπθ term.
- Off-Policy Estimation: All derivations are performed in an off-policy setting, using importance sampling with samples collected from the old policy (πold). This allows for more efficient training by reusing data.
- Analysis of Existing Methods: The systematic derivations lead to insights about existing methods:
- A theoretical inconsistency is identified in GRPO's KL term estimation related to the missing importance weight in the KL penalty gradient.
- REINFORCE++'s KL term is analyzed, highlighting that its form and placement act more like a fixed reward shaping based on the old and SFT policies rather than a direct dynamic regularization of the current policy πθ.
- Practical Implementation Details:
- The framework uses an iterative process where the policy is updated over steps, and the new policy becomes the old policy for the next iteration.
- The implementation leverages modern RL techniques such as PPO-style clipping (specifically Dual-Clip), baseline subtraction for variance reduction, and efficient data sampling strategies inspired by DAPO.
- A practical advantage mentioned is that by pre-computing πold probabilities, training the current policy πθ only requires one model in GPU memory, improving efficiency compared to methods requiring simultaneous access to multiple models.
- The paper investigates the impact of different optimizers, finding that Schedule-Free AdamW can offer improved stability in some cases (Appendix E).
- Empirical Validation: Extensive experiments are conducted on mathematical reasoning benchmarks (AMC23, AIME24, AIME25) using Qwen2.5 models. The results (Table 3, Figures 2, 3, Appendix E) demonstrate that the proposed RPG methods (both fully differentiable and REINFORCE-style variants across different KL forms) achieve stable training dynamics and exhibit performance that is competitive with or superior to strong baselines like GRPO, REINFORCE++, and DAPO. The experiments also include ablation studies on clipping parameters for the REINFORCE-style variants, revealing sensitivity, particularly for models that are already well pre-trained.
In summary, RPG provides a principled framework for understanding and implementing KL-regularized policy gradient methods for LLM reasoning, offering a range of theoretically grounded objective and loss function choices with empirical validation demonstrating their effectiveness and stability in practice. The code implementation is open-source, facilitating further research and application.