Entropy Regularised Policy Iteration

Updated 1 January 2026

Entropy Regularised Policy Iteration is a reinforcement learning framework that augments classical policy iteration with an entropy bonus to encourage exploration and stabilize updates.
It employs KL divergence and entropy regularization to control per-iteration policy changes, ensuring convergence and robust performance in both discrete and continuous control settings.
The method connects to soft actor-critic, trust-region, and mirror descent approaches, offering practical benefits in exploration, stability, and efficiency for complex RL tasks.

Entropy Regularised Policy Iteration is a class of algorithms for reinforcement learning and stochastic control that augments classical policy iteration with an explicit regularization term favoring high-entropy (i.e., more randomized) policies, or penalizing deviation from a reference or previous policy, usually via relative entropy terms. This regularization—typically implemented as a Kullback-Leibler (KL) divergence or Shannon entropy bonus—aims to stabilize policy updates, promote effective exploration, restrict per-iteration changes, and admit rigorous convergence guarantees and contractivity properties. The entropy regularization paradigm underpins a range of practical, theoretically justified algorithms for both discrete and continuous control, with close connections to “soft policy iteration” in deep RL, mirror descent, trust-region methods, and natural gradient flows.

1. Formulation of the Entropy-Regularised Objective

The entropy-regularised policy iteration framework modifies the canonical expected return objective of reinforcement learning by introducing a penalty term that measures the “distance” between updated and previous policies, or a bonus proportional to the entropy of the policy. For a policy $\pi$ , the standard objective is

$J(\pi) = \mathbb{E}_{\tau \sim \pi}\Bigl[\sum_{t=0}^\infty \gamma^t r(s_t, a_t)\Bigr]$

where $\tau$ is a trajectory under $\pi$ . In entropy-regularised policy iteration, the objective becomes

$\tilde J(\pi) = \mathbb{E}_{\tau\sim\pi}\Bigl[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\Bigr] -\eta\; \mathbb{E}_{s \sim \mu_{\pi_\text{old}}} \bigl[D_{\rm KL}(\pi(\cdot|s)\|\pi_\text{old}(\cdot|s))\bigr]$

with $\eta > 0$ controlling the strength of regularization and $\mu_{\pi_\text{old}}$ the discounted state visitation distribution under the reference policy. The KL term restricts policy divergence per-iteration, providing a trust-region on policy space and stabilizing optimization trajectories (Abdolmaleki et al., 2018).

2. Algorithmic Structure: Three-Stage Policy Iteration

The canonical entropy-regularised policy iteration algorithm can be decomposed into three stages, each of which admits flexible instantiation:

Policy Evaluation: Fit a parametric Q-function $Q_\phi(s, a)$ by minimizing the temporal-difference (TD) error over replayed samples, typically via stochastic gradient descent:

$\min_\phi\,\mathbb{E}_{(s,a,r,s') \sim \mathcal D}\Bigl[ r + \gamma Q_{\phi'}(s', a') - Q_\phi(s, a) \Bigr]^2$

with $a' \sim \pi_{\rm old}(\cdot | s')$ and a potentially lagged target network $\phi'$ .

Policy Improvement (Non-Parametric): For each sampled state $s_j$ , draw actions $a_{ij} \sim \pi_{\rm old}(\cdot|s_j)$ , evaluate $Q_{ij} = Q_\phi(s_j, a_{ij})$ , and compute normalized weights

$q_{ij} = \frac{\exp(Q_{ij} / \eta)}{\sum_k \exp(Q_{kj} / \eta)}$

where $\eta$ is solved by minimizing the dual function $g(\eta) = \eta \epsilon + \eta \frac{1}{K} \sum_{j=1}^K \log \left[\frac{1}{N}\sum_{i=1}^N \exp(Q_{ij}/\eta)\right]$ under the KL constraint.

Policy Generalization (Parametric Fitting): Fit a parametric policy $\pi_\theta(a|s)$ by maximum likelihood estimation weighted by $q_{ij}$ and constrained by the average KL to the reference policy:

$\max_\theta \frac{1}{K} \sum_{j=1}^K \sum_{i=1}^N q_{ij} \log \pi_\theta(a_{ij}|s_j) \quad \text{s.t.} \quad \frac{1}{K} \sum_{j=1}^K D_{\rm KL}(\pi_{\rm old}(\cdot|s_j)\|\pi_\theta(\cdot|s_j)) \leq \epsilon_\pi$

For Gaussian policies, mean and covariance can be decoupled with separate KL thresholds (Abdolmaleki et al., 2018).

3. Theoretical Properties and Convergence Analysis

The introduction of entropy regularization fundamentally alters the contraction properties and algorithmic structure of policy iteration:

Newton-Raphson Equivalence: Entropy-regularised policy iteration coincides with one step of Newton–Raphson on the smoothed Bellman equation, yielding explicit global linear convergence with rate $\gamma$ and local quadratic convergence in a neighborhood of the optimum (Li et al., 2023).
Contractivity and Monotonicity: The regularized Bellman operator is a $\gamma$ –contraction in sup-norm, ensuring unique fixed points and monotone convergence (Smirnova et al., 2019).
Finite-Step and Inexact Evaluation: Policy iteration with $M$ finite evaluation steps is equivalent to inexact Newton, with contraction rate $\gamma^M$ , providing a precise trade-off between computation and convergence speed (Li et al., 2023).
Suboptimality Bounds: For fixed regularization parameter $\tau>0$ , the optimal $\tau$ -regularized value function is $O(\tau)$ –suboptimal relative to the hard optimal value, with sharp upper and lower bounds decaying as $\exp(-c /\tau)$ , depending on the gap between optimal and suboptimal actions (Müller et al., 2024).

4. Exploration, Stability, and Practical Implications

Relative entropy regularization confers profound practical benefits:

Exploration Guarantee: The KL-constraint prevents the policy from immediately collapsing to greedy or deterministic behavior, ensuring continuous stochastic exploration (Abdolmaleki et al., 2018, Roostaie et al., 2021).
Update Stability: By bounding the per-iteration KL change, the algorithm prevents catastrophic policy changes (destructive updates), a critical property for high-dimensional or deep RL contexts (Abdolmaleki et al., 2018).
Trust-Region Interpretation: The regularization formalizes trust-region methods, connecting entropy-regularized policy iteration to TRPO, natural gradient, and mirror-descent paradigms (Roostaie et al., 2021, Li et al., 2023).
Generalization and Robustness: Empirical studies show competent handling of both low- and high-dimensional action spaces, robustness to Q-function approximation, and competitive or superior performance across a broad sweep of standard benchmarks (DeepMind Control Suite, Parkour suite, OpenAI Gym/MuJoCo) (Abdolmaleki et al., 2018).

5. Algorithmic Variants and Connections

Entropy-regularised policy iteration admits a diverse array of concrete realizations:

Variant/Framework	Regularization	Key Distinction / Link
Maximum a Posteriori Policy Optimisation (MPO)	KL w.r.t. old policy	Weight-based local improvement, hard constraint
Trust-Region Policy Optimization (TRPO), EnTRPO	KL trust region	On-policy, replay buffer, additive entropy bonus
Soft Q-Learning / Soft Actor-Critic (SAC)	Shannon entropy	Softmax policy improvement, policy gradient
f-Divergence/Mirror Ascent RL	General $f$ -divergence	Unifies entropy-regularized/Chi-squared updates

The general framework subsumes and provides theoretical foundations for commonly used reinforcement learning schemes, including soft value/value-iteration, regularized approximate dynamic programming, and advanced actor-critic approaches (Abdolmaleki et al., 2018, Roostaie et al., 2021, Smirnova et al., 2019, Belousov et al., 2019).

6. Extensions: Continuous Control and Distributional Limits

Extensions of entropy regularized policy iteration include:

Continuous Control: Entropy-regularized policy iteration for continuous state and action spaces employs Gibbs (Boltzmann) densities as soft-greedy policies, leading to analytical forms for the relaxed Hamilton–Jacobi–Bellman equations and convergence of the policy iteration algorithm even in nonlinear diffusions (Huang et al., 2022, Feng et al., 9 Jun 2025, Ma et al., 2024, Tran et al., 2024).
Distributional and Maximum Entropy RL: In the vanishing entropy limit ( $\tau \to 0$ ), policy iteration with temperature decoupling converges to interpretable and diversity-preserving policies, such as uniform distributions over equally-optimal actions (“reference-optimality”), and supports algorithms estimating return distributions with precise convergence guarantees (Jhaveri et al., 9 Oct 2025).
State Distribution Entropy: Maximizing discounted future state distribution entropy, rather than policy entropy, further encourages thorough exploration of the state-space, improving coverage and learning speed in sparse and high-difficulty tasks (Islam et al., 2019, Islam et al., 2019).

7. Empirical Performance and Implementation Details

Empirical studies confirm the practical advantages of entropy-regularised policy iteration:

Sample Efficiency and Robustness: The method is effective in both tabular and high-dimensional function-approximation settings, with rapid, robust convergence observed even with approximate Q-functions and a single set of hyperparameters (Abdolmaleki et al., 2018). Decoupled update schemes for mean and covariance in Gaussian policies control premature variance collapse in high dimensions.
Safety and “Conservatism”: Early iterates of entropy-regularized algorithms exhibit safer, more globally-exploratory behavior (e.g., avoiding hazardous “cliff” regions in gridworlds) compared to pure greedy approaches (Smirnova et al., 2019).
Algorithmic Structure: The three-part policy evaluation, non-parametric improvement, and parametric generalization loop is easily implemented with neural function approximators, off-policy (replay buffer) sampling and weighted MLE, with regularization strengths tuned by dual function optimization or grid search (Abdolmaleki et al., 2018).

In summary, entropy-regularised policy iteration constitutes a unified, principled framework with profound theoretical and practical implications for stable, exploratory, and maximally-efficient learning and control in modern reinforcement learning systems (Abdolmaleki et al., 2018, Li et al., 2023, Müller et al., 2024).