On the Design of KL-Regularized Policy Gradient Algorithms for LLM Reasoning (2505.17508v1)

Published 23 May 2025 in cs.LG, cs.AI, and cs.CL

Abstract: Policy gradient algorithms have been successfully applied to enhance the reasoning capabilities of LLMs. Despite the widespread use of Kullback-Leibler (KL) regularization in policy gradient algorithms to stabilize training, the systematic exploration of how different KL divergence formulations can be estimated and integrated into surrogate loss functions for online reinforcement learning (RL) presents a nuanced and systematically explorable design space. In this paper, we propose regularized policy gradient (RPG), a systematic framework for deriving and analyzing KL-regularized policy gradient methods in the online RL setting. We derive policy gradients and corresponding surrogate loss functions for objectives regularized by both forward and reverse KL divergences, considering both normalized and unnormalized policy distributions. Furthermore, we present derivations for fully differentiable loss functions as well as REINFORCE-style gradient estimators, accommodating diverse algorithmic needs. We conduct extensive experiments on RL for LLM reasoning using these methods, showing improved or competitive results in terms of training stability and performance compared to strong baselines such as GRPO, REINFORCE++, and DAPO. The code is available at https://github.com/complex-reasoning/RPG.

Summary

The paper introduces a systematic framework for designing KL-regularized policy gradient algorithms aimed at improving LLM reasoning in off-policy reinforcement learning.
It derives both fully differentiable and REINFORCE-style surrogate loss functions for multiple KL divergence formulations, offering clear guidelines for efficient policy optimization.
Extensive experiments on mathematical reasoning benchmarks validate the framework's stability and competitive performance against strong baseline methods.

This paper introduces Regularized Policy Gradient (RPG), a systematic framework for designing and analyzing KL-regularized policy gradient algorithms specifically for enhancing the reasoning capabilities of LLMs in an online, off-policy reinforcement learning setting (2505.17508).

The core idea of RPG is to systematically explore the design space of KL regularization in policy gradient objectives, considering different formulations of KL divergence and their corresponding gradient estimators and surrogate loss functions suitable for off-policy optimization. The framework operates iteratively, using the policy from the previous iteration ( $\pi_{\text{old}}$ ) as the reference for KL regularization and as the behavior policy for sampling data.

Key aspects and contributions of the RPG framework include:

Systematic Derivation of Gradients and Losses: The paper derives off-policy policy gradients and corresponding surrogate loss functions for objectives that combine the expected reward with a KL regularization term. This is done for:
- Forward KL divergence ($\KL(\pi_{\text{old}} \| \pi_\theta)$).
- Reverse KL divergence ($\KL(\pi_\theta \| \pi_{\text{old}})$).
- Unnormalized versions of both Forward and Reverse KL (UKL and URKL), which account for potential differences in total mass between distributions and relate to estimators like the $k_3$ estimator.
Two Styles of Surrogate Loss Functions: For each KL formulation, the paper provides derivations for two types of surrogate loss functions for gradient-based optimization:
- Fully Differentiable Losses: These are derived such that minimizing the loss directly yields the negative of the target objective's gradient (Table 1). This is a common approach in modern off-policy RL like PPO, where the loss implicitly accounts for importance sampling and the KL term.
- REINFORCE-Style Losses: These losses are structured similarly to the classic REINFORCE algorithm, utilizing the stop-gradient ($\SG$) operator (Table 2, Appendix C). They are designed such that automatic differentiation yields the correct policy gradient estimate by effectively treating the "advantage" term (which includes reward and KL components) as a fixed coefficient for the $\nabla \log \pi_\theta$ term.
Off-Policy Estimation: All derivations are performed in an off-policy setting, using importance sampling with samples collected from the old policy ( $\pi_{\text{old}}$ ). This allows for more efficient training by reusing data.
Analysis of Existing Methods: The systematic derivations lead to insights about existing methods:
- A theoretical inconsistency is identified in GRPO's KL term estimation related to the missing importance weight in the KL penalty gradient.
- REINFORCE++'s KL term is analyzed, highlighting that its form and placement act more like a fixed reward shaping based on the old and SFT policies rather than a direct dynamic regularization of the current policy $\pi_\theta$ .
Practical Implementation Details:
- The framework uses an iterative process where the policy is updated over steps, and the new policy becomes the old policy for the next iteration.
- The implementation leverages modern RL techniques such as PPO-style clipping (specifically Dual-Clip), baseline subtraction for variance reduction, and efficient data sampling strategies inspired by DAPO.
- A practical advantage mentioned is that by pre-computing $\pi_{\text{old}}$ probabilities, training the current policy $\pi_\theta$ only requires one model in GPU memory, improving efficiency compared to methods requiring simultaneous access to multiple models.
- The paper investigates the impact of different optimizers, finding that Schedule-Free AdamW can offer improved stability in some cases (Appendix E).
Empirical Validation: Extensive experiments are conducted on mathematical reasoning benchmarks (AMC23, AIME24, AIME25) using Qwen2.5 models. The results (Table 3, Figures 2, 3, Appendix E) demonstrate that the proposed RPG methods (both fully differentiable and REINFORCE-style variants across different KL forms) achieve stable training dynamics and exhibit performance that is competitive with or superior to strong baselines like GRPO, REINFORCE++, and DAPO. The experiments also include ablation studies on clipping parameters for the REINFORCE-style variants, revealing sensitivity, particularly for models that are already well pre-trained.

In summary, RPG provides a principled framework for understanding and implementing KL-regularized policy gradient methods for LLM reasoning, offering a range of theoretically grounded objective and loss function choices with empirical validation demonstrating their effectiveness and stability in practice. The code implementation is open-source, facilitating further research and application.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - complex-reasoning/RPG: The official implementation of Regularized Policy Gradient (RPG) (https://arxiv.org/abs/2505.17508) (3 stars)

Tweets

https://twitter.com/YIFENGLIU_AI/status/1926839185371017265

https://twitter.com/AGIArena/status/1927170301911077188