Papers
Topics
Authors
Recent
2000 character limit reached

Relative Entropy Regularized Policy Iteration

Updated 15 December 2025
  • Relative Entropy Regularization is a technique in RL that uses KL divergence penalties to control policy updates and balance exploration and exploitation.
  • It integrates policy evaluation, non-parametric improvement, and parametric projection steps to ensure stable and efficient convergence in high-dimensional environments.
  • Implementations like REPPO and MPO demonstrate enhanced sample efficiency and robust theoretical guarantees by adapting regularization strengths during learning.

Relative entropy regularized policy iteration (RERPI) refers to a class of algorithms and theoretical frameworks for reinforcement learning (RL) and stochastic control in which policy update steps are constrained or penalized by the Kullback-Leibler (KL) divergence between the new and old policy, or the current policy and a fixed prior. This approach is central to many modern RL algorithms, including on-policy and off-policy actor-critic methods, entropy-regularized policy iteration for continuous-time ergodic control, and robust trust-region optimization. Relative entropy regularization provides control over the exploration-exploitation trade-off, improves stability, and enables strong theoretical guarantees on convergence and robustness in highly stochastic or high-dimensional environments.

1. Fundamental Formulation and Learning Objectives

Relative entropy regularization modifies the classic Bellman or policy iteration schemes by introducing a KL divergence penalty that constrains the divergence between consecutive policies. The general learning objective for a stochastic policy πθ\pi_\theta is to maximize a regularized expected return: JME(πθ)=Eτπθ[t=0γt(r(xt,at)+αH[πθ(xt)])]J_\mathrm{ME}(\pi_\theta) = \mathbb{E}_{\tau\sim\pi_\theta} \biggl[ \sum_{t=0}^\infty \gamma^t \Bigl(r(x_t,a_t) + \alpha H[\pi_\theta(\cdot|x_t)] \Bigr) \biggr] subject to a KL constraint limiting the deviation from a reference policy πθ\pi_{\theta'}: Exρπθ[DKL(πθ(x)πθ(x))]KLtarget\mathbb{E}_{x\sim\rho_{\pi_{\theta'}}}\left[ D_\mathrm{KL}(\pi_\theta(\cdot|x) \| \pi_{\theta'}(\cdot|x)) \right] \leq \mathrm{KL}_{\rm target} and enforcing a minimum entropy: Exρπθ[H(πθ(x))]Htarget\mathbb{E}_{x\sim\rho_{\pi_{\theta'}}} \left[ H(\pi_\theta(\cdot|x)) \right] \geq H_{\rm target}

This constraint-primal can be equivalently transformed into an unconstrained penalized optimization with log-space Lagrange multipliers α\alpha (for entropy) and β\beta (for KL): Lπ(θ;B)=1B(xi,ai)B[Qϕ(xi,ai)+eαlogπθ(aixi)+eβDKL(πθ(xi)πθ(xi))]L_\pi(\theta;B) = \frac{1}{|B|} \sum_{(x_i,a_i)\sim B} \left[ -Q_\phi(x_i,a_i') + e^\alpha \log \pi_\theta(a_i'|x_i) + e^\beta D_\mathrm{KL}(\pi_\theta(\cdot|x_i) \|\pi_{\theta'}(\cdot|x_i)) \right] where aiπθ(xi)a_i' \sim \pi_\theta(\cdot|x_i) and BB is a minibatch from the behavior policy πθ\pi_{\theta'} (Voelcker et al., 15 Jul 2025).

2. Algorithmic Structures and Policy Update Mechanisms

The standard RERPI workflow is structured in three main steps, which admit multiple algorithmic instantiations:

  1. Policy Evaluation: Update the critic QϕQ_\phi by minimizing a loss relative to temporally extended, entropy-augmented targets, often via multi-step TD(λ) or TD(0), strictly on-policy or off-policy. Critic targets incorporate an entropy correction:

r~t=rtαlogπθ(atxt)\tilde{r}_t = r_t - \alpha \log \pi_{\theta'}(a_t|x_t)

Multi-step returns and distributional approximations (e.g., histogram-based cross-entropy losses) enable scale-invariant, robust value function learning (Voelcker et al., 15 Jul 2025).

  1. Non-parametric Policy Improvement (E-step): For sampled states, reweight samples/actions via a Boltzmann/Gibbs distribution over QQ-values under a KL-bound constraint:

q(ax)πold(ax)exp(Q(x,a)/η)q^*(a|x) \propto \pi_\mathrm{old}(a|x) \exp(Q(x,a)/\eta)

The temperature η\eta is solved via the dual optimization to meet the KL constraint (Abdolmaleki et al., 2018).

  1. Parametric Policy Projection (M-step): Fit or update the parametric policy πθ\pi_\theta via weighted maximum likelihood or natural gradient steps, subject to an additional relative entropy constraint. In Gaussian policy classes, mean and covariance adaptation is decoupled for stability, analogously to trust-region CMA-ES (Abdolmaleki et al., 2018).

On-policy approaches such as REPPO (Voelcker et al., 15 Jul 2025) employ the pathwise policy gradient (deterministic gradient estimator via the Q-function) for efficient, low-variance actor updates. Off-policy variants, exemplified by Relative Entropy MPO, alternate sample-based improvement and parametric projection steps.

3. Architectural and Stabilization Techniques

Empirical performance and learning stability in relative entropy regularized policy iteration are highly sensitive to architectural choices:

  • Critic Loss Functions: HL-Gauss histogram-based cross-entropy targets for per-transition return distributions improve scale-invariance and robustness against value collapse (Voelcker et al., 15 Jul 2025).
  • Layer Normalization: Layer norm after each linear layer in both actor and critic networks mitigates covariate shift, improving stability under rapidly changing data distributions (Voelcker et al., 15 Jul 2025).
  • Latent Self-Prediction or Auxiliary Losses: Augmenting the critic with predictive auxiliary heads that enforce temporal consistency in internal representations reduces under-learning and collapse, particularly at small batch sizes (Voelcker et al., 15 Jul 2025).

4. Adaptive Regularization and Dual Variable Schedules

The regularization strengths (entropy weight α\alpha, KL penalty weight β\beta) are not statically set but are adapted by gradient steps targeting mean constraint satisfaction: ααηααeα[ExH(πθ(x))Htarget]\alpha \leftarrow \alpha - \eta_\alpha \,\nabla_\alpha\,e^\alpha [\mathbb{E}_x H(\pi_\theta(\cdot|x))-H_{\rm target}]

ββηββeβ[ExDKL(πθπθ)KLtarget]\beta \leftarrow \beta - \eta_\beta \,\nabla_\beta\,e^\beta [\mathbb{E}_x D_{\rm KL}(\pi_\theta\|\pi_{\theta'})-{\rm KL}_{\rm target}]

This ensures policies remain sufficiently stochastic (for exploration) and prevents excessive divergence from prior or behavior policies, supporting monotonic improvement and robust convergence (Voelcker et al., 15 Jul 2025).

In continuous-time and infinite-horizon stochastic control (e.g., entropy-regularized HJB settings), policy iteration with entropy or KL penalties enjoys provable convergence and regularization-induced uniqueness of optimal policies under mild structural assumptions (Huang et al., 2022, Tran et al., 2 Jun 2024, Ma et al., 16 Jun 2024).

5. Theoretical Guarantees and Convergence Analysis

Relative entropy regularized policy iteration admits rigorous theoretical guarantees in both discrete and continuous settings:

  • Linear or Quadratic Convergence: Under standard assumptions and smooth strongly convex regularizers, relative entropy regularized policy iteration is strictly equivalent to Newton-Raphson on the regularized Bellman equations, with global linear rate γ\gamma (discount) and local quadratic convergence in the neighborhood of the solution (Li et al., 2023).
  • Soft Bellman Operators and Contractions: The "soft" (entropy-regularized) Bellman operator is a γ\gamma-contraction up to an additive error proportional to the regularization strength, yielding uniform convergence bounds:

VNV21γ(t=1N1γNtτt+γNV0V)\|\,V_N - V^*\|_\infty \leq \frac{2}{1-\gamma}\left( \sum_{t=1}^{N-1}\gamma^{N-t}\,\tau_t + \gamma^N\|V_0 - V^*\|_\infty\right)

with temperature schedules τt0\tau_t\to0 recovering unregularized optimal policies (Smirnova et al., 2019).

  • Continuous Control and HJB Equations: Policy iteration for entropy-regularized stochastic control problems converges to the unique (smooth) solution of the so-called exploratory Hamilton–Jacobi–Bellman equation, with convergence rates that can be super-exponential in the large-discount regime. Gibbs-form policy updates and uniform C2C^2-bounds for the value function are established under standard regularity and nondegeneracy conditions (Huang et al., 2022, Tran et al., 2 Jun 2024, Ma et al., 16 Jun 2024).

6. Empirical Results and Benchmarks

Relative entropy regularized policy iteration algorithms have been evaluated across high-dimensional continuous control benchmarks. Key empirical findings include:

Algorithm Policy Update Type Replay Buffer KL/Entropy Scheduling Sample Efficiency Stability Memory
REPPO (Voelcker et al., 15 Jul 2025) On-policy, Pathwise On-policy buffer (~1e5) Adaptive 2–3× PPO ~80% 100–1000× lower than SAC/FastTD3
MPO/RERPI (Abdolmaleki et al., 2018) Off-policy, E/M steps Large replay buffer (2M) Fixed SoA on 31 tasks High Larger

REPPO (Relative Entropy Pathwise Policy Optimization) achieves strong sample-efficiency, high stability (fraction of runs above 90% of final performance), and significant reductions in memory and hyperparameter tuning requirements. Ablation studies consistently identify the KL regularizer and robust critic objective as critical to performance. Without these, learning stability and asymptotic returns degrade significantly (Voelcker et al., 15 Jul 2025).

Regularized policy iteration frameworks are also foundational in Nash equilibrium computation for entropy-regularized general-sum linear-quadratic games, where uniqueness and contractivity can be established for all Nash equilibria within the class of linear-Gaussian policies, and linear convergence of policy optimization algorithms is guaranteed when the regularizer exceeds an explicit threshold (Zaman et al., 25 Mar 2024).

7. Connections to Broader RL and Stochastic Control Paradigms

Relative entropy regularization unifies a spectrum of RL algorithms:

  • Trust region methods (TRPO, PPO): employ KL-constraints or penalties to ensure local policy improvement (Belousov et al., 2019, Roostaie et al., 2021).
  • Maximum a Posteriori Policy Optimization (MPO): instantiates E-step/M-step updates derived from an EM/inference framework, explicitly parameterizing the policy as a softmax over Q-values with KL control (Abdolmaleki et al., 2018).
  • Soft Actor-Critic (SAC), Soft Q-Learning: can be viewed as special cases of entropy-regularized policy iteration with fixed regularization strengths, with theoretical insight suggesting improved performance when decreasing the regularization parameter as training progresses (Smirnova et al., 2019).

In the continuous-time stochastic control literature, entropy-regularized policy iteration guarantees regularization-induced well-posedness of the associated HJB equations and enables robust policies even in highly degenerate or nonconvex reward landscapes (Huang et al., 2022, Tran et al., 2 Jun 2024, Ma et al., 16 Jun 2024).

Relative entropy regularized policy iteration thus fundamentally connects probabilistic inference, black-box optimization, actor-critic learning, and convex-analytic viewpoints, forming a technical backbone for state-of-the-art exploration, stability, and robustness guarantees in modern RL and control.

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Relative Entropy Regularized Policy Iteration.