Soft Policy Iteration

Updated 25 February 2026

Soft policy iteration is a reinforcement learning approach that replaces standard policy improvement with entropy regularization to yield smooth Bellman operators and softmax policies.
The method alternates between soft policy improvement and policy evaluation, enabling enhanced exploration and robust convergence guarantees.
Its equivalence to a Newton–Raphson update provides both global linear and local quadratic convergence rates under strong convexity conditions.

Soft policy iteration refers to a class of dynamic programming algorithms for Markov decision processes (MDPs) in which the standard policy improvement step is replaced by optimization with respect to a strongly convex regularizer—most commonly the Shannon entropy. This “softening” or regularization leads to smoothed Bellman operators, differentiable softmax policies, and improved exploration properties. Contemporary research reveals a deep connection between soft policy iteration and Newton–Raphson root-finding methods applied to a smoothed Bellman equation, thereby enabling a rigorous convergence analysis with explicit linear and local quadratic rates (Li et al., 2023). Soft policy iteration forms the theoretical backbone of widely used modern algorithms such as Soft Q-Learning and Soft Actor-Critic.

1. Formalization: Regularized Bellman Operators and Smoothed Bellman Equation

Let $M = (\mathcal{S}, \mathcal{A}, P, r, \gamma)$ be a finite MDP. For Shannon entropy regularization, define the per-state regularizer as

$\Omega(\pi(\cdot|s)) \triangleq \sum_{a} \pi(a|s)\, \log \pi(a|s).$

The regularized Bellman operator and its associated self-consistency form are:

Regularized self-consistency operator (for policy $\pi$ and action-value function $q$ ):

$[T^{\pi}_\Omega q](s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a)\left(\langle \pi(\cdot|s'), q(s',\cdot)\rangle - \Omega(\pi(\cdot|s'))\right).$

Regularized Bellman operator:

$[B_\Omega q](s,a) = \max_{\pi(\cdot|s)} [T^{\pi}_\Omega q](s,a).$

The $\max_\Omega$ smoothed max operator yields, for $x\in\mathbb{R}^m$ and temperature $N$ ,

$\max_\Omega(x) = \frac{1}{N}\log \sum_{i=1}^m \exp(N x_i),$

with gradient given by the softmax: $\nabla \max_\Omega(x)_i = \frac{\exp(N x_i)}{\sum_j \exp(N x_j)}$ .

The smoothed Bellman equation in vector notation is

$F_\Omega(q) := B_\Omega(q) - q = \gamma P f_\Omega(q) + r - q = 0,$

where $[f_\Omega(q)]_s = \max_\Omega(q(s,\cdot))$ .

2. Soft Policy Iteration Algorithm

Soft Policy Iteration alternates between entropy-regularized policy improvement and policy evaluation. The canonical algorithmic structure is:

Policy Improvement:

$\pi_{k+1}(\cdot|s) = \nabla \max_\Omega(q_k(s,\cdot))$

which, for Shannon entropy, is the standard softmax.

Policy Evaluation:

Solve for $q_{k+1}$ as the unique solution of

$q = T^{\pi_{k+1}}_\Omega(q).$

This corresponds (for full evaluation) to solving the linear system

$[I - \gamma P\,\mathrm{diag}\{\pi_{k+1}\}] q_{k+1} = r - \gamma P\, e_\Omega(q_k),$

where $[e_\Omega(q_k)]_s = -\frac{1}{N}\Omega(\pi_{k+1}(\cdot|s))$ (Li et al., 2023).

Step	Formula/Mechanism	Resulting Object
Policy Improvement	$\pi_{k+1}(\cdot\|s) = \nabla \max_\Omega(q_k(s,\cdot))$ (Softmax)	New stochastic policy
Policy Evaluation	Solve $q = T^{\pi_{k+1}}_\Omega(q)$ (or $M$ steps for inexact), linear system for $q_{k+1}$	New value or Q-function

The same principle applies to the value-based soft policy iteration as in “Soft Policy Iteration (SPI),” where the optimal regularized Bellman operator is

$(\mathcal{T}^*_\tau V)(s) = \tau \log \sum_{a} \exp(Q_V(s,a)/\tau),$

with the improved policy given by the Boltzmann distribution (Smirnova et al., 2019).

3. Equivalence to Newton–Raphson Method

A central advance is the observation that regularized (soft) policy iteration with strongly convex regularizer is strictly equivalent to a one-step Newton iteration applied to the smoothed Bellman equation $F_\Omega(q)=0$ . Specifically, Newton’s update reads

$q_{k+1}^{\mathrm{NR}} = q_k - [F'_\Omega(q_k)]^{-1} F_\Omega(q_k),$

and can be algebraically verified (Theorem III.1 of (Li et al., 2023)) to coincide with the solution of the regularized policy evaluation step when $\pi_{k+1} = \nabla\,\max_\Omega(q_k)$ . The Jacobian $F'_\Omega(q) = \gamma P \nabla f_\Omega(q) - I$ drives the direction and scaling of the update, and the regularization ensures invertibility and strong monotonicity.

This equivalence enables a unified framework for the analysis of both global and local convergence properties of entropy-regularized dynamic programming algorithms.

4. Global Linear and Local Quadratic Convergence Rate

The convergence of soft policy iteration, under Shannon entropy regularization, is formally characterized as follows:

Global Linear Rate: For any initial $q_0$ , the sequence $\{q_k\}$ is monotone ( $q_1 \leq q_2 \leq \cdots \leq q_*$ ), bounded above by $q_*$ , and converges $\gamma$ -linearly:

$\|q_* - q_{k+1}\|_{\infty} \leq \gamma \|q_* - q_k\|_{\infty}.$

An explicit bound is

$\|q_* - q_k\|_{\infty} \leq \frac{2\gamma^k}{1-\gamma} \min\left\{\|q_* - q_0\|_\infty, \|F_\Omega(q_0)\|_\infty\right\}.$

Local Quadratic Rate: When $q_k$ enters a sufficiently small neighborhood $r$ around $q_*$ ,

$\|q_{k+1} - q_*\|_\infty \leq C \|q_k - q_*\|_\infty^2,$

with constants $C, r$ depending on $\gamma$ , $N$ , and the strong convexity of $\Omega$ . This quadratic convergence regime is enabled by the strong convexity and Lipschitz properties of the regularizer (Li et al., 2023).

In the more general regularized Modified Policy Iteration (soft MPI) framework, choosing the evaluation depth $m$ and the regularization temperature $\tau$ yields explicit rates and optimality bounds. For a decaying $\tau_t\to 0$ , convergence to the unregularized $V^*$ is established, accompanied by error bounds depending on the regularization schedule (Smirnova et al., 2019).

5. Inexact Newton Method: Finite-Step Policy Evaluation and $\gamma^M$ Rate

When policy evaluation is performed only approximately—i.e., via a finite number $M$ of Jacobi or value iteration steps—the resulting update implements an inexact Newton method. This modification yields the system

$q_{k+1} = (T^{\pi_{k+1}_\Omega})^{M}(q_k) = q_k + \sum_{i=0}^{M-1}(\gamma P\nabla f_\Omega(q_k))^i F_\Omega(q_k),$

accompanied by a residual term capturing the error due to incomplete evaluation:

$F'_\Omega(q_k) s_k = -F_\Omega(q_k) + r_k,\quad q_{k+1} = q_k + s_k,$

where $r_k$ decays as $M$ increases.

The asymptotic local linear convergence rate is then $\gamma^M$ in an appropriate norm, yielding an error bound

$\|q_{k}-q_*\|_\infty = O((\gamma^M)^k).$

This formalizes the intuition that more evaluation steps per improvement accelerate the local rate, interpolating between pure policy iteration ( $\gamma$ convergence) and value iteration or full Newton ( $\gamma^M$ ) (Li et al., 2023).

6. Extensions, Function Approximation, and Practical Consequences

In the function approximation regime, as addressed by stationary-reweighted Soft FQI (Laan et al., 30 Dec 2025), the contraction property of the soft Bellman operator holds only in the stationary norm of the soft-optimal policy ( $\mu_*$ ), rather than in the behavior norm typical of standard FQI. To restore contraction, a stationary-reweighting is employed, aligning the regression with the contraction geometry. Formal results guarantee local linear convergence up to an error floor determined by misspecification, density ratio estimation, and finite sample effects.

Continuation approaches—where the temperature parameter $\tau$ is annealed from high to low—permit global iterative solution, traversing a sequence of increasingly less regularized problems. Under appropriate margin conditions, contraction persists even as $\tau\to 0$ (Laan et al., 30 Dec 2025).

Empirically, soft policy iteration and entropy-regularized algorithms demonstrate enhanced exploration, robustness to environmental stochasticity, and improved safety (e.g., avoidance of risky trajectories), at the cost of a controllable sub-optimality gap. Softened improvement steps are also central to frameworks for deep RL, including Soft Modified Policy Iteration (MoSoPI) and MoPPO, which combine partial off-policy evaluation with clipping or trust-region constraints, producing substantial gains in sample efficiency versus on-policy methods (Merdivan et al., 2019, Smirnova et al., 2019).

7. Connections to Broader Algorithmic Families

Soft policy iteration subsumes or unifies a broad range of entropy-regularized RL algorithms, including Soft Q-Learning and Soft Actor-Critic, and provides the theoretical foundation for their convergence rates and stability. The equivalence to Newton–Raphson iterations facilitates use of powerful tools from nonlinear optimization and fixed-point theory. Notably, soft policy iteration guarantees global linear and local quadratic convergence in the presence of strongly convex regularizers, and its generalizations under function approximation and partial evaluation extend its practical impact across the landscape of contemporary reinforcement learning algorithms (Li et al., 2023, Laan et al., 30 Dec 2025, Smirnova et al., 2019, Merdivan et al., 2019).

Markdown Report Issue Upgrade to Chat

References (4)

Bridging the Gap between Newton-Raphson Method and Regularized Policy Iteration (2023)

On the Convergence of Approximate and Regularized Policy Iteration Schemes (2019)

Stationary Reweighting Yields Local Convergence of Soft Fitted Q-Iteration (2025)

Modified Actor-Critics (2019)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft Policy Iteration.

Soft Policy Iteration

1. Formalization: Regularized Bellman Operators and Smoothed Bellman Equation

2. Soft Policy Iteration Algorithm

3. Equivalence to Newton–Raphson Method

4. Global Linear and Local Quadratic Convergence Rate

5. Inexact Newton Method: Finite-Step Policy Evaluation and $\gamma^M$ Rate

6. Extensions, Function Approximation, and Practical Consequences

7. Connections to Broader Algorithmic Families

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Soft Policy Iteration

1. Formalization: Regularized Bellman Operators and Smoothed Bellman Equation

2. Soft Policy Iteration Algorithm

3. Equivalence to Newton–Raphson Method

4. Global Linear and Local Quadratic Convergence Rate

5. Inexact Newton Method: Finite-Step Policy Evaluation and γM\gamma^MγM Rate

6. Extensions, Function Approximation, and Practical Consequences

7. Connections to Broader Algorithmic Families

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

5. Inexact Newton Method: Finite-Step Policy Evaluation and $\gamma^M$ Rate