Relative Entropy Regularized Policy Iteration

Updated 15 December 2025

Relative Entropy Regularization is a technique in RL that uses KL divergence penalties to control policy updates and balance exploration and exploitation.
It integrates policy evaluation, non-parametric improvement, and parametric projection steps to ensure stable and efficient convergence in high-dimensional environments.
Implementations like REPPO and MPO demonstrate enhanced sample efficiency and robust theoretical guarantees by adapting regularization strengths during learning.

Relative entropy regularized policy iteration (RERPI) refers to a class of algorithms and theoretical frameworks for reinforcement learning (RL) and stochastic control in which policy update steps are constrained or penalized by the Kullback-Leibler (KL) divergence between the new and old policy, or the current policy and a fixed prior. This approach is central to many modern RL algorithms, including on-policy and off-policy actor-critic methods, entropy-regularized policy iteration for continuous-time ergodic control, and robust trust-region optimization. Relative entropy regularization provides control over the exploration-exploitation trade-off, improves stability, and enables strong theoretical guarantees on convergence and robustness in highly stochastic or high-dimensional environments.

1. Fundamental Formulation and Learning Objectives

Relative entropy regularization modifies the classic Bellman or policy iteration schemes by introducing a KL divergence penalty that constrains the divergence between consecutive policies. The general learning objective for a stochastic policy $\pi_\theta$ is to maximize a regularized expected return: $J_\mathrm{ME}(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \biggl[ \sum_{t=0}^\infty \gamma^t \Bigl(r(x_t,a_t) + \alpha H[\pi_\theta(\cdot|x_t)] \Bigr) \biggr]$ subject to a KL constraint limiting the deviation from a reference policy $\pi_{\theta'}$ : $\mathbb{E}_{x\sim\rho_{\pi_{\theta'}}}\left[ D_\mathrm{KL}(\pi_\theta(\cdot|x) \| \pi_{\theta'}(\cdot|x)) \right] \leq \mathrm{KL}_{\rm target}$ and enforcing a minimum entropy: $\mathbb{E}_{x\sim\rho_{\pi_{\theta'}}} \left[ H(\pi_\theta(\cdot|x)) \right] \geq H_{\rm target}$

This constraint-primal can be equivalently transformed into an unconstrained penalized optimization with log-space Lagrange multipliers $\alpha$ (for entropy) and $\beta$ (for KL): $L_\pi(\theta;B) = \frac{1}{|B|} \sum_{(x_i,a_i)\sim B} \left[ -Q_\phi(x_i,a_i') + e^\alpha \log \pi_\theta(a_i'|x_i) + e^\beta D_\mathrm{KL}(\pi_\theta(\cdot|x_i) \|\pi_{\theta'}(\cdot|x_i)) \right]$ where $a_i' \sim \pi_\theta(\cdot|x_i)$ and $B$ is a minibatch from the behavior policy $\pi_{\theta'}$ (Voelcker et al., 15 Jul 2025).

2. Algorithmic Structures and Policy Update Mechanisms

The standard RERPI workflow is structured in three main steps, which admit multiple algorithmic instantiations:

Policy Evaluation: Update the critic $Q_\phi$ by minimizing a loss relative to temporally extended, entropy-augmented targets, often via multi-step TD(λ) or TD(0), strictly on-policy or off-policy. Critic targets incorporate an entropy correction:

$\tilde{r}_t = r_t - \alpha \log \pi_{\theta'}(a_t|x_t)$

Multi-step returns and distributional approximations (e.g., histogram-based cross-entropy losses) enable scale-invariant, robust value function learning (Voelcker et al., 15 Jul 2025).

Non-parametric Policy Improvement (E-step): For sampled states, reweight samples/actions via a Boltzmann/Gibbs distribution over $Q$ -values under a KL-bound constraint:

$q^*(a|x) \propto \pi_\mathrm{old}(a|x) \exp(Q(x,a)/\eta)$

The temperature $\eta$ is solved via the dual optimization to meet the KL constraint (Abdolmaleki et al., 2018).

Parametric Policy Projection (M-step): Fit or update the parametric policy $\pi_\theta$ via weighted maximum likelihood or natural gradient steps, subject to an additional relative entropy constraint. In Gaussian policy classes, mean and covariance adaptation is decoupled for stability, analogously to trust-region CMA-ES (Abdolmaleki et al., 2018).

On-policy approaches such as REPPO (Voelcker et al., 15 Jul 2025) employ the pathwise policy gradient (deterministic gradient estimator via the Q-function) for efficient, low-variance actor updates. Off-policy variants, exemplified by Relative Entropy MPO, alternate sample-based improvement and parametric projection steps.

3. Architectural and Stabilization Techniques

Empirical performance and learning stability in relative entropy regularized policy iteration are highly sensitive to architectural choices:

Critic Loss Functions: HL-Gauss histogram-based cross-entropy targets for per-transition return distributions improve scale-invariance and robustness against value collapse (Voelcker et al., 15 Jul 2025).
Layer Normalization: Layer norm after each linear layer in both actor and critic networks mitigates covariate shift, improving stability under rapidly changing data distributions (Voelcker et al., 15 Jul 2025).
Latent Self-Prediction or Auxiliary Losses: Augmenting the critic with predictive auxiliary heads that enforce temporal consistency in internal representations reduces under-learning and collapse, particularly at small batch sizes (Voelcker et al., 15 Jul 2025).

4. Adaptive Regularization and Dual Variable Schedules

The regularization strengths (entropy weight $\alpha$ , KL penalty weight $\beta$ ) are not statically set but are adapted by gradient steps targeting mean constraint satisfaction: $\alpha \leftarrow \alpha - \eta_\alpha \,\nabla_\alpha\,e^\alpha [\mathbb{E}_x H(\pi_\theta(\cdot|x))-H_{\rm target}]$

$\beta \leftarrow \beta - \eta_\beta \,\nabla_\beta\,e^\beta [\mathbb{E}_x D_{\rm KL}(\pi_\theta\|\pi_{\theta'})-{\rm KL}_{\rm target}]$

This ensures policies remain sufficiently stochastic (for exploration) and prevents excessive divergence from prior or behavior policies, supporting monotonic improvement and robust convergence (Voelcker et al., 15 Jul 2025).

In continuous-time and infinite-horizon stochastic control (e.g., entropy-regularized HJB settings), policy iteration with entropy or KL penalties enjoys provable convergence and regularization-induced uniqueness of optimal policies under mild structural assumptions (Huang et al., 2022, Tran et al., 2 Jun 2024, Ma et al., 16 Jun 2024).

5. Theoretical Guarantees and Convergence Analysis

Relative entropy regularized policy iteration admits rigorous theoretical guarantees in both discrete and continuous settings:

Linear or Quadratic Convergence: Under standard assumptions and smooth strongly convex regularizers, relative entropy regularized policy iteration is strictly equivalent to Newton-Raphson on the regularized Bellman equations, with global linear rate $\gamma$ (discount) and local quadratic convergence in the neighborhood of the solution (Li et al., 2023).
Soft Bellman Operators and Contractions: The "soft" (entropy-regularized) Bellman operator is a $\gamma$ -contraction up to an additive error proportional to the regularization strength, yielding uniform convergence bounds:

$\|\,V_N - V^*\|_\infty \leq \frac{2}{1-\gamma}\left( \sum_{t=1}^{N-1}\gamma^{N-t}\,\tau_t + \gamma^N\|V_0 - V^*\|_\infty\right)$

with temperature schedules $\tau_t\to0$ recovering unregularized optimal policies (Smirnova et al., 2019).

Continuous Control and HJB Equations: Policy iteration for entropy-regularized stochastic control problems converges to the unique (smooth) solution of the so-called exploratory Hamilton–Jacobi–Bellman equation, with convergence rates that can be super-exponential in the large-discount regime. Gibbs-form policy updates and uniform $C^2$ -bounds for the value function are established under standard regularity and nondegeneracy conditions (Huang et al., 2022, Tran et al., 2 Jun 2024, Ma et al., 16 Jun 2024).

6. Empirical Results and Benchmarks

Relative entropy regularized policy iteration algorithms have been evaluated across high-dimensional continuous control benchmarks. Key empirical findings include:

Algorithm	Policy Update Type	Replay Buffer	KL/Entropy Scheduling	Sample Efficiency	Stability	Memory
REPPO (Voelcker et al., 15 Jul 2025)	On-policy, Pathwise	On-policy buffer (~1e5)	Adaptive	2–3× PPO	~80%	100–1000× lower than SAC/FastTD3
MPO/RERPI (Abdolmaleki et al., 2018)	Off-policy, E/M steps	Large replay buffer (2M)	Fixed	SoA on 31 tasks	High	Larger

REPPO (Relative Entropy Pathwise Policy Optimization) achieves strong sample-efficiency, high stability (fraction of runs above 90% of final performance), and significant reductions in memory and hyperparameter tuning requirements. Ablation studies consistently identify the KL regularizer and robust critic objective as critical to performance. Without these, learning stability and asymptotic returns degrade significantly (Voelcker et al., 15 Jul 2025).

Regularized policy iteration frameworks are also foundational in Nash equilibrium computation for entropy-regularized general-sum linear-quadratic games, where uniqueness and contractivity can be established for all Nash equilibria within the class of linear-Gaussian policies, and linear convergence of policy optimization algorithms is guaranteed when the regularizer exceeds an explicit threshold (Zaman et al., 25 Mar 2024).

7. Connections to Broader RL and Stochastic Control Paradigms

Relative entropy regularization unifies a spectrum of RL algorithms:

Trust region methods (TRPO, PPO): employ KL-constraints or penalties to ensure local policy improvement (Belousov et al., 2019, Roostaie et al., 2021).
Maximum a Posteriori Policy Optimization (MPO): instantiates E-step/M-step updates derived from an EM/inference framework, explicitly parameterizing the policy as a softmax over Q-values with KL control (Abdolmaleki et al., 2018).
Soft Actor-Critic (SAC), Soft Q-Learning: can be viewed as special cases of entropy-regularized policy iteration with fixed regularization strengths, with theoretical insight suggesting improved performance when decreasing the regularization parameter as training progresses (Smirnova et al., 2019).

In the continuous-time stochastic control literature, entropy-regularized policy iteration guarantees regularization-induced well-posedness of the associated HJB equations and enables robust policies even in highly degenerate or nonconvex reward landscapes (Huang et al., 2022, Tran et al., 2 Jun 2024, Ma et al., 16 Jun 2024).

Relative entropy regularized policy iteration thus fundamentally connects probabilistic inference, black-box optimization, actor-critic learning, and convex-analytic viewpoints, forming a technical backbone for state-of-the-art exploration, stability, and robustness guarantees in modern RL and control.