Regularized Policy Gradient (RPG)

Updated 6 November 2025

Regularized Policy Gradient (RPG) is a reinforcement learning framework that incorporates entropy or KL divergence to balance exploration and policy stability.
It unifies policy-search with value-based methods, delivering improved convergence guarantees and performance in tasks from Atari games to continuous control.
Extensions like KL-regularization and trust-region approaches enhance RPG's applicability in multi-objective scenarios and large-scale model fine-tuning.

Regularized Policy Gradient (RPG) methods form a principled class of algorithms in reinforcement learning (RL) that introduce explicit regularization—most often entropy or Kullback-Leibler (KL) divergence terms—into the policy optimization objective. RPG methods address limitations of classical policy gradient algorithms, enhance stability and exploration, allow for algorithmic unification with value-based approaches, and enable convergence guarantees in complex and practical RL settings.

1. Mathematical Foundations and Core Principle

RPG methods augment the standard RL objective with a regularization term that penalizes certain aspects of the policy, typically aiming to encourage sufficient exploration, prevent policy collapse, or bias policy iterates toward desirable regions. The common entropy-regularized objective for Markov Decision Processes (MDPs) is: $J_{\tau}(\pi) = \mathbb{E}_{\pi}\left[\sum_{t=0}^\infty \gamma^t \left( r(s_t, a_t) - \tau \log \pi(a_t|s_t) \right)\right]$ where $\tau > 0$ is the entropy regularization coefficient. The regularized policy gradient is: $\Delta \theta \propto \mathbb{E}_{s,a} \left[ Q^\pi(s,a) \nabla_\theta \log \pi(s,a) + \tau \nabla_\theta H^\pi(s) \right]$ with $H^\pi(s) = -\sum_a \pi(s,a)\log \pi(s,a)$ .

A fundamental result (O'Donoghue et al., 2016) is that, at the fixed point of the entropy-regularized objective, the optimal policy is a softmax over the advantage: $\pi^*(s,a) \propto \exp\left( \frac{A^{\pi^*}(s,a)}{\tau} \right)$ This relationship enables invertibility between policy and value, underpins many theoretical and algorithmic advances, and facilitates connections to value-based RL methods.

2. Algorithmic Variants and Extensions

RPG encompasses a spectrum of methods, depending on the nature of regularization and the algorithmic context:

Entropy-Regularized Policy Gradient: Classic setting as above, often serving as the basis for actor-critic methods and maximum entropy RL (e.g., Soft Actor-Critic).
KL-Regularized Policy Gradient: Widely applied in RL fine-tuning and LLM RLHF, regularizing the current policy toward a reference (or prior) policy:

$J^{\mathrm{KL}}(\pi) = \mathbb{E}_{\pi} \left[ r(s,a) \right] - \beta\, \mathrm{KL}(\pi(\cdot|s) \| \pi_{\textrm{ref}}(\cdot|s))$

(Zhang et al., 23 May 2025, Wang et al., 14 Mar 2025)

Composite Regularization (L2, custom): For example, L2-regularization in multi-armed bandits (Anita et al., 9 Feb 2024) or Riemannian metric tensor regularization controlling the Hessian trace of the policy gradient vector field (Chen et al., 2023).
Proximal Regularized PG (Trust-region): E.g., Mirror descent/natural policy gradient with KL proximity penalty or clipping, yielding variants like PPO and anchor-changing NPG for stability (Zhou et al., 2022, Zhong et al., 8 Aug 2025).
Cubic Regularization and Higher-Order Regularizers: For escape from saddle points and convergence to second-order stationary points (SOSP) (Maniyar et al., 2023).

A unifying theme is that the form and scale of regularization fundamentally govern exploration-exploitation balance, convergence landscape, and often enable off-policy extensions or algorithmic blends (e.g., with Q-learning (O'Donoghue et al., 2016)).

3. Convergence Theory and Optimization Landscape

Entropy or KL regularization often induces strong convexity or gradient domination properties in the policy optimization landscape, with substantial theoretical consequences:

Global Linear Convergence: Entropy-regularized softmax policy gradient and natural policy gradient methods admit global linear convergence for a wide range of step sizes, improving practical robustness (Liu et al., 4 Apr 2024).
Exponential/Quadratic Convergence: Under appropriate regimes (including natural PG or soft policy iteration), quadratic local rates are achievable (Liu et al., 4 Apr 2024, Guo et al., 2023, Diaz et al., 3 Oct 2025).
Mean-Field Regime and Nonlinear Dynamics: With neural parameterization, gradient flows in measure space (via Fokker-Planck-Kolmogorov PDEs) can yield unique stationary solutions and exponential convergence, provided regularization in both policy and parameter spaces is sufficient (Kerimkulov et al., 2022).
Robustness and Uniqueness: Strong regularization guarantees uniqueness of optimal policies, stability under initialization, and sensitivity control (Kerimkulov et al., 2022, Anita et al., 9 Feb 2024, Guo et al., 2023).
Primal-Dual Extensions for Constraints: RPG in primal-dual frameworks ensures last-iterate convergence and constraint satisfaction in constrained MDPs (Ding et al., 2023), outperforming classical Lagrangian dual methods in both theory and application.
Sample Complexity and Stability: Hybrid and high-order regularizers can further enhance sample efficiency and stability, reducing the trajectory complexity for approximate optima (Pham et al., 2020, Maniyar et al., 2023).

4. Algorithmic Integration and Applications

RPG serves as an organizing principle for a broad suite of practical algorithms:

PGQL: Combines regularized policy gradient with off-policy Q-learning by observing that advantage-based softmax policies induce implicit Q-values; integrating Bellman updates enables improved data efficiency, stability, and unifies value- and policy-based RL (O'Donoghue et al., 2016).
PG for Robust MDPs: Closed-form robust gradients enable efficient, scalable robust PG under rectangular uncertainty, matching non-robust PG in complexity and scaling to large state/action spaces (Kumar et al., 2023).
Application to Multi-Objective and Constrained RL: Anchor-changing KL-regularized NPG achieves fast convergence for multi-objective and constrained RL by leveraging regularization-induced structure (Zhou et al., 2022, Ding et al., 2023).
LLM Fine-Tuning and Large-Scale RL: RPG provides a mathematical framework for implementing and analyzing KL-regularized objectives in LLMs, specifying the precise connection to surrogate and stop-gradient losses, and clarifying the importance of correct importance weighting in off-policy settings (Zhang et al., 23 May 2025).
Sample-Based Model-Free RPG: Zero-order and finite-difference approximations allow RPG, and its convergence results, to extend seamlessly to settings with unknown dynamics, including LQC with multiplicative noise (Diaz et al., 3 Oct 2025).

5. Empirical Performance and Benchmarks

Experimental results across a wide range of RL settings consistently highlight the benefits of RPG:

Atari Games (PGQL): PGQL outperforms both A3C (vanilla actor-critic) and Q-learning in human-normalized scores, learning speed, and stability on the Atari benchmark. PGQL achieved the highest mean and median scores, with best scores in 34 games out of the full Atari suite (O'Donoghue et al., 2016).
Continuous Control/Locomotion: Reparameterization Proximal Policy Optimization (RPO), a stabilized RPG variant, attains the highest sample efficiency and final returns on DFlex and Rewarped benchmarks, outperforming SAPO, SHAC, and PPO (Zhong et al., 8 Aug 2025).
Multi-Armed Bandit: L2-regularized policy gradient converges in mean-square to the unique optimum under appropriate schedules, with rate $O(1/(t\log t))$ , while regularization ensures stability in ill-conditioned regimes (Anita et al., 9 Feb 2024).
Robust RL: RPG for robust MDPs (with rectangular uncertainty) achieves running time 1–3 orders of magnitude faster than LP-based robust PG, scaling similarly to standard PG (Kumar et al., 2023).
Large-Scale LLM RL: RPG-REINFORCE with properly corrected KL and RPG-Style Clip improves mathematical reasoning accuracy by up to +6 percentage points over DAPO on AIME24/AIME25 (Zhang et al., 23 May 2025).

Comparative empirical summaries:

Method	Mean Score (Atari, %)	Sample Efficiency (Locomotion)	Robust/Constrained RL	LLM RL Fine-tune
A3C	636.8	Moderate	N/A	N/A
Q-learning	756.3	Moderate	N/A	N/A
PGQL (RPG)	877.2	N/A	N/A	N/A
PPO (model-free)	Low	Lower	N/A	Lower than RPG
RPO (RPG)	N/A	Highest	N/A	N/A
SAPO/SHAC	N/A	High	N/A	N/A
RPG-RLHF (LLM)	N/A	N/A	N/A	+6% > DAPO

6. Generalizations and Theoretical Connections

RPG frameworks admit further generalization and deep connections to broader RL and optimization paradigms:

Equivalence to Advantage Learning and Value Fitting: Actor-critic (regularized policy gradient) and value fitting (Q/SARSA) with Boltzmann policies are mathematically equivalent under certain parameterizations (O'Donoghue et al., 2016).
Interpreted as Advantage Function Learning: RPG optimizes advantage regression; at optimum, the policy-implied Q-values are an orthogonal projection of true Q-values in the score function space (O'Donoghue et al., 2016).
Mean-Field and Gradient-Flow Formulations: With neural parametrization and mean-field analysis, entropy regularization induces uniqueness and exponential convergence in 2-Wasserstein metric, enabling global convergence even in function-approximate and continuous control settings (Kerimkulov et al., 2022).
Residual Policy Gradient: Customization and fine-tuning objectives with RPG generalize KL-regularized RL, providing a reward-level view for balancing exploitation of prior policies and adaptation to new tasks (Wang et al., 14 Mar 2025).
Second-Order and Geometry-Aware Extensions: Deep metric tensor regularization and cubic regularization extend RPG to Riemannian and higher-order settings for improved stability, performance, and convergence to higher-order optima (Chen et al., 2023, Maniyar et al., 2023).

7. Limitations, Ongoing Directions, and Practical Considerations

While RPG methods provide structural and empirical advances, several considerations guide ongoing research and practice:

Choice and Scale of Regularization: Excessive regularization may bias solutions, while insufficient regularization risks instability; adaptive parameter tuning and sensitivity analysis are active areas (Kerimkulov et al., 2022).
Computational Overheads: Geometry-aware (e.g., metric tensor) and second-order variants introduce computational complexity, motivating the development of efficient approximations.
Off-Policy Corrections: Correct importance weighting remains critical in large-scale and off-policy RPG-like implementations, especially for LLM RL (Zhang et al., 23 May 2025).
Scalability for Multi-Agent and Robust/RL Tasks: RPG's structure supports extensions to Nash equilibrium computation in multi-agent games (Yu et al., 21 Oct 2025) and robust RL formulations (Kumar et al., 2023).

In sum, Regularized Policy Gradient methods offer a unified mathematical and algorithmic framework that bridges policy-search and value-based RL, introduces provable stabilizing structure to optimization, enables robust exploration, and supports large-scale, off-policy, and high-dimensional applications with strong theoretical guarantees and empirical results.