Relative Entropy Regularized Policy Iteration

Updated 28 February 2026

Relative Entropy Regularized Policy Iteration is a family of algorithms that solve Markov decision processes by adding a KL divergence penalty to promote smooth, exploratory policies.
It unifies concepts from classical control, stochastic calculus, and modern reinforcement learning, offering robust convergence and empirical stability.
The method alternates between entropy-regularized Bellman evaluation and softmax-based policy improvement, applicable in both continuous and discrete settings.

Relative Entropy Regularized Policy Iteration (RERPI) is a family of algorithms for solving Markov decision processes (MDPs) and entropy-regularized stochastic control problems in both finite and continuous spaces by extending classical policy iteration with a relative-entropy (Kullback–Leibler divergence) penalty. The framework provides theoretical regularity, improved empirical stability, and contraction properties, as well as explicit connections with variational (softmax) exploration and robust convergence guarantees. RERPI unifies concepts emerging from classical control, stochastic calculus, and contemporary reinforcement learning.

1. Mathematical Formulation and Entropy-Regularization

Relative entropy regularization introduces an entropic penalty—typically the KL divergence between the candidate policy and a reference distribution or the Lebesgue measure—to enforce smoothness and promote exploration. In the infinite-horizon, discounted cost setting on $\mathbb{R}^d$ with compact action space $U$ , the system evolves as

$dX_{t} = \int_U b(X_t,u)\pi_t(du)dt + \int_U \sigma(X_t,u)\pi_t(du)dW_t,$

where $\pi_t$ is a relaxed control (probability density on $U$ ). The entropy-regularized value functional is

$V^{\pi}(x) = \mathbb{E}_x\Bigg[ \int_0^{\infty} e^{-pt} \Big\{ \int_U r(X_t,u)\pi_t(u)du - \int_U \ln(\pi_t(u))\pi_t(u)du \Big\} dt \Bigg],$

for discount $p>0$ and reward $r$ . The value function of the optimal policy satisfies the exploratory Hamilton–Jacobi–Bellman (HJB) equation (Tran et al., 2024, Huang et al., 2022):

$p\,v(x) - \sup_{\pi \in \mathcal{P}(U)} \int_U \left[ r(x,u) + b(x,u)\cdot Dv(x) + \frac{1}{2}\mathrm{tr}(\sigma(x,u)\sigma(x,u)^T D^2v(x)) - \ln \pi(u) \right] \pi(u) du = 0.$

The optimal relaxed control attains the Gibbs (softmax) form:

$\pi^*(x)(u) = \frac{ \exp \left( r(x,u) + b(x,u)\cdot Dv(x) + \tfrac{1}{2} \mathrm{tr}(\sigma\sigma^T(x,u) D^2v(x)) \right) } {\int_U \exp \left( r(x,u') + b(x,u')\cdot Dv(x) + \tfrac{1}{2} \mathrm{tr}(\sigma\sigma^T(x,u') D^2v(x)) \right) du' }$

This is the continuous-space analogue of the “softmax” policy in discrete-space RERPI (Neu et al., 2017, Abdolmaleki et al., 2018).

2. Policy Iteration Algorithmic Structures

RERPI alternates between two principal steps: policy evaluation and policy improvement.

Policy Evaluation: For policy $\pi^n$ , solve the linear (entropy-regularized) Bellman equation for $v^n$ :

$p\,v^n(x) - \int_U \Big[ r(x,u) + b(x,u)\cdot Dv^n(x) + \frac{1}{2} \mathrm{tr}(\sigma\sigma^T(x,u) D^2 v^n(x)) - \ln \pi^n(x,u) \Big ] \pi^n(x,u) du = 0.$

Policy Improvement: The new policy is computed via a softmax update:

$\pi^{n+1}(x, u) \propto \exp \left( r(x,u) + b(x,u)\cdot Dv^n(x) + \frac{1}{2}\mathrm{tr}(\sigma\sigma^T(x,u) D^2 v^n(x)) \right).$

This structure generalizes immediately to the tabular (finite-state-action) setting and to off-policy actor–critic implementations in RL. In finite MDPs, an analogous iteration alternates between “soft” Bellman evaluation (possibly approximate) and relative-entropy-based policy improvement, with explicit connection to mirror descent and dual averaging (Neu et al., 2017, Smirnova et al., 2019).

3. Theoretical Properties: Regularity and Convergence

RERPI enjoys strong convergence and stability properties across various regimes:

Bounded Coefficients and Controlled Diffusion: Under $\mathcal{C}^{0,\alpha}$ -bounded coefficients in $b$ , $r$ , and $\sigma$ and small control influence on $\sigma$ , one establishes a uniform $\mathcal{C}^{2,\alpha}$ bound on the value iterates $v^n$ , implying local convergence to the unique solution $v^*$ of the exploratory HJB (Tran et al., 2024):

$\|v^n\|_{C^{2,\alpha}(\mathbb{R}^d)} \leq A_1,\quad \text{for all } n.$

Quantitative Convergence: The error $e_n := v^n - v^*$ decays exponentially in $n$ when $\sigma$ is $u$ -independent, with rate estimates such as

$\int_{B_1} |D e_n|^2 = O(2^{-n} + p^{-1}).$

Unbounded Coefficients: If $b$ , $r$ have polynomial/linear growth and $\sigma$ is independent of $u$ , the value functions maintain locally uniform $C^{1,\alpha}$ regularity, ensuring convergence $v^n \to v^*$ (Tran et al., 2024).
Numerical and Functional Analytical Foundations: Key technical difficulties arise from entropy terms growing only logarithmically in $|Dv|$ . Advanced Sobolev and H\"older space estimates, including new interior and entropy-control lemmas, are employed for establishing uniform bounds and the necessary compactness to pass to limits (Huang et al., 2022).
Discrete MDPs and RL: In finite MDPs, RERPI operates as a contraction mapping in the sup-norm, with explicit rates depending on regularization decay. Exact RERPI converges globally for fixed entropy parameter. As the parameter increases, the solution approaches the unregularized case (Neu et al., 2017, Smirnova et al., 2019).

4. Practical Algorithms and Computational Aspects

Implementation of RERPI spans continuous PDE-based stochastic control, tabular/finite MDP solvers, and deep RL with function approximation.

Setting	Policy Eval Step	Policy Improvement
PDE Control (continuous)	Solve linear elliptic PDE (mesh, N_dof)	Integrate softmax over $U$
Tabular MDP/discrete RL	Matrix-vector Bellman update	Elementwise softmax
Off-policy RL (actor–critic)	TD target regression/critic update	Weighted M-step (KL-proj)

PDE Solution: Each iteration requires solving a linear elliptic PDE; complexity is $O(N_{\text{dof}}^3)$ via direct solvers or nearly linear with multigrid/preconditioners (Tran et al., 2024).
Softmax Integration: In continuous actions, efficient quadrature or Monte Carlo sampling is essential for policy improvement.
Empirical Stability: The entropy regularization prevents premature collapse to “bang–bang” (deterministic/extremal) controls and improves safety and trajectory robustness, as demonstrated in high-dimensional control and gridworld experiments (Abdolmaleki et al., 2018, Neu et al., 2017, Smirnova et al., 2019).
Hyperparameter Schedules: In RL, annealing the regularization (e.g., $\tau_k \to 0$ polynomially or exponentially) transitions the algorithm from exploration to pure exploitation, with explicit rate-control implications (Smirnova et al., 2019).

5. Relation to Other Regularized and Mirror-Descent Algorithms

The RERPI framework encapsulates and formalizes several lines of previous research:

Mirror Descent: RERPI can be interpreted as mirror descent in the policy space with Bregman divergence given by relative entropy, producing the classical mean update with softmax improvement (Neu et al., 2017).
Trust-Region and REPS: RERPI bridges maximum a-posteriori policy optimization (MPO), trust-region methods, and the relative entropy policy search (REPS), generalizing these approaches by layering an explicit entropy regularization and often dual updating of mean/covariance in policy parameterizations (Abdolmaleki et al., 2018, Pacchiano et al., 2021).
Entropy-Regularized RL: The method generalizes “maximum-entropy” RL (e.g., Soft Q-Learning, Soft Actor-Critic) by formulating the entropy penalty as a principled KL constraint and advocating explicit split between E-step (non-parametric softmax) and M-step (parametric projection) (Abdolmaleki et al., 2018, Smirnova et al., 2019).

6. Empirical Performance and Application Domains

RERPI and its algorithmic instances have demonstrated robust empirical performance in a variety of domains:

High-Dimensional Continuous Control: RERPI achieves leading results across benchmarks such as DeepMind Control Suite, Parkour, and OpenAI Gym while ensuring stability across a wide hyperparameter range. Robustness is enhanced via dual KL constraints and decoupled mean/covariance updates (Abdolmaleki et al., 2018).
Gridworlds and Discrete Domains: In structured environments, RERPI variants outperform both unregularized methods and those with weak or excessive regularization, provided the parameter schedule is chosen to match exploration needs (Neu et al., 2017).
Safety and Robustness: Regularized policy iterates exhibit greater tolerance to environmental and model stochasticity; intermediate iterates show “safe” behaviors (e.g., avoidance of dangerous states in cliff-walking) due to persistent entropy-bonus-driven exploration (Smirnova et al., 2019).
Model-Free RL: As shown in actor–critic frameworks and conflict-averse RL (CASA), enforcing compatible gradient updates and path consistency via joint regularization recovers RERPI behavior and improves both final performance and training stability (Xiao et al., 2021).

7. Summary and Outlook

Relative Entropy Regularized Policy Iteration provides a theoretically principled, algorithmically versatile class of iterative schemes for entropy-regularized control and reinforcement learning. By integrating softmax-based exploration at the policy-improvement level and KL divergence regularization in the optimization objective, RERPI methods achieve provable convergence, improved exploration-exploitation trade-offs, and empirical robustness across a range of RL and control settings. The convergence theory for both finite and continuous systems now rests on uniform regularity estimates, contraction properties, and compactness in suitable function spaces, providing a rigorous basis for further algorithmic innovations and practical implementations (Tran et al., 2024, Huang et al., 2022, Neu et al., 2017, Abdolmaleki et al., 2018, Smirnova et al., 2019).