Papers
Topics
Authors
Recent
Search
2000 character limit reached

Soft Policy Iteration

Updated 25 February 2026
  • Soft policy iteration is a reinforcement learning approach that replaces standard policy improvement with entropy regularization to yield smooth Bellman operators and softmax policies.
  • The method alternates between soft policy improvement and policy evaluation, enabling enhanced exploration and robust convergence guarantees.
  • Its equivalence to a Newton–Raphson update provides both global linear and local quadratic convergence rates under strong convexity conditions.

Soft policy iteration refers to a class of dynamic programming algorithms for Markov decision processes (MDPs) in which the standard policy improvement step is replaced by optimization with respect to a strongly convex regularizer—most commonly the Shannon entropy. This “softening” or regularization leads to smoothed Bellman operators, differentiable softmax policies, and improved exploration properties. Contemporary research reveals a deep connection between soft policy iteration and Newton–Raphson root-finding methods applied to a smoothed Bellman equation, thereby enabling a rigorous convergence analysis with explicit linear and local quadratic rates (Li et al., 2023). Soft policy iteration forms the theoretical backbone of widely used modern algorithms such as Soft Q-Learning and Soft Actor-Critic.

1. Formalization: Regularized Bellman Operators and Smoothed Bellman Equation

Let M=(S,A,P,r,γ)M = (\mathcal{S}, \mathcal{A}, P, r, \gamma) be a finite MDP. For Shannon entropy regularization, define the per-state regularizer as

Ω(π(s))aπ(as)logπ(as).\Omega(\pi(\cdot|s)) \triangleq \sum_{a} \pi(a|s)\, \log \pi(a|s).

The regularized Bellman operator and its associated self-consistency form are:

  • Regularized self-consistency operator (for policy π\pi and action-value function qq):

[TΩπq](s,a)=r(s,a)+γsP(ss,a)(π(s),q(s,)Ω(π(s))).[T^{\pi}_\Omega q](s,a) = r(s,a) + \gamma \sum_{s'} P(s'|s,a)\left(\langle \pi(\cdot|s'), q(s',\cdot)\rangle - \Omega(\pi(\cdot|s'))\right).

  • Regularized Bellman operator:

[BΩq](s,a)=maxπ(s)[TΩπq](s,a).[B_\Omega q](s,a) = \max_{\pi(\cdot|s)} [T^{\pi}_\Omega q](s,a).

The maxΩ\max_\Omega smoothed max operator yields, for xRmx\in\mathbb{R}^m and temperature NN,

maxΩ(x)=1Nlogi=1mexp(Nxi),\max_\Omega(x) = \frac{1}{N}\log \sum_{i=1}^m \exp(N x_i),

with gradient given by the softmax: maxΩ(x)i=exp(Nxi)jexp(Nxj)\nabla \max_\Omega(x)_i = \frac{\exp(N x_i)}{\sum_j \exp(N x_j)}.

The smoothed Bellman equation in vector notation is

FΩ(q):=BΩ(q)q=γPfΩ(q)+rq=0,F_\Omega(q) := B_\Omega(q) - q = \gamma P f_\Omega(q) + r - q = 0,

where [fΩ(q)]s=maxΩ(q(s,))[f_\Omega(q)]_s = \max_\Omega(q(s,\cdot)).

2. Soft Policy Iteration Algorithm

Soft Policy Iteration alternates between entropy-regularized policy improvement and policy evaluation. The canonical algorithmic structure is:

  • Policy Improvement:

πk+1(s)=maxΩ(qk(s,))\pi_{k+1}(\cdot|s) = \nabla \max_\Omega(q_k(s,\cdot))

which, for Shannon entropy, is the standard softmax.

  • Policy Evaluation:

Solve for qk+1q_{k+1} as the unique solution of

q=TΩπk+1(q).q = T^{\pi_{k+1}}_\Omega(q).

This corresponds (for full evaluation) to solving the linear system

[IγPdiag{πk+1}]qk+1=rγPeΩ(qk),[I - \gamma P\,\mathrm{diag}\{\pi_{k+1}\}] q_{k+1} = r - \gamma P\, e_\Omega(q_k),

where [eΩ(qk)]s=1NΩ(πk+1(s))[e_\Omega(q_k)]_s = -\frac{1}{N}\Omega(\pi_{k+1}(\cdot|s)) (Li et al., 2023).

Step Formula/Mechanism Resulting Object
Policy Improvement πk+1(s)=maxΩ(qk(s,))\pi_{k+1}(\cdot|s) = \nabla \max_\Omega(q_k(s,\cdot)) (Softmax) New stochastic policy
Policy Evaluation Solve q=TΩπk+1(q)q = T^{\pi_{k+1}}_\Omega(q) (or MM steps for inexact), linear system for qk+1q_{k+1} New value or Q-function

The same principle applies to the value-based soft policy iteration as in “Soft Policy Iteration (SPI),” where the optimal regularized Bellman operator is

(TτV)(s)=τlogaexp(QV(s,a)/τ),(\mathcal{T}^*_\tau V)(s) = \tau \log \sum_{a} \exp(Q_V(s,a)/\tau),

with the improved policy given by the Boltzmann distribution (Smirnova et al., 2019).

3. Equivalence to Newton–Raphson Method

A central advance is the observation that regularized (soft) policy iteration with strongly convex regularizer is strictly equivalent to a one-step Newton iteration applied to the smoothed Bellman equation FΩ(q)=0F_\Omega(q)=0. Specifically, Newton’s update reads

qk+1NR=qk[FΩ(qk)]1FΩ(qk),q_{k+1}^{\mathrm{NR}} = q_k - [F'_\Omega(q_k)]^{-1} F_\Omega(q_k),

and can be algebraically verified (Theorem III.1 of (Li et al., 2023)) to coincide with the solution of the regularized policy evaluation step when πk+1=maxΩ(qk)\pi_{k+1} = \nabla\,\max_\Omega(q_k). The Jacobian FΩ(q)=γPfΩ(q)IF'_\Omega(q) = \gamma P \nabla f_\Omega(q) - I drives the direction and scaling of the update, and the regularization ensures invertibility and strong monotonicity.

This equivalence enables a unified framework for the analysis of both global and local convergence properties of entropy-regularized dynamic programming algorithms.

4. Global Linear and Local Quadratic Convergence Rate

The convergence of soft policy iteration, under Shannon entropy regularization, is formally characterized as follows:

  • Global Linear Rate: For any initial q0q_0, the sequence {qk}\{q_k\} is monotone (q1q2qq_1 \leq q_2 \leq \cdots \leq q_*), bounded above by qq_*, and converges γ\gamma-linearly:

qqk+1γqqk.\|q_* - q_{k+1}\|_{\infty} \leq \gamma \|q_* - q_k\|_{\infty}.

An explicit bound is

qqk2γk1γmin{qq0,FΩ(q0)}.\|q_* - q_k\|_{\infty} \leq \frac{2\gamma^k}{1-\gamma} \min\left\{\|q_* - q_0\|_\infty, \|F_\Omega(q_0)\|_\infty\right\}.

  • Local Quadratic Rate: When qkq_k enters a sufficiently small neighborhood rr around qq_*,

qk+1qCqkq2,\|q_{k+1} - q_*\|_\infty \leq C \|q_k - q_*\|_\infty^2,

with constants C,rC, r depending on γ\gamma, NN, and the strong convexity of Ω\Omega. This quadratic convergence regime is enabled by the strong convexity and Lipschitz properties of the regularizer (Li et al., 2023).

  • In the more general regularized Modified Policy Iteration (soft MPI) framework, choosing the evaluation depth mm and the regularization temperature τ\tau yields explicit rates and optimality bounds. For a decaying τt0\tau_t\to 0, convergence to the unregularized VV^* is established, accompanied by error bounds depending on the regularization schedule (Smirnova et al., 2019).

5. Inexact Newton Method: Finite-Step Policy Evaluation and γM\gamma^M Rate

When policy evaluation is performed only approximately—i.e., via a finite number MM of Jacobi or value iteration steps—the resulting update implements an inexact Newton method. This modification yields the system

$q_{k+1} = (T^{\pi_{k+1}_\Omega})^{M}(q_k) = q_k + \sum_{i=0}^{M-1}(\gamma P\nabla f_\Omega(q_k))^i F_\Omega(q_k),$

accompanied by a residual term capturing the error due to incomplete evaluation:

FΩ(qk)sk=FΩ(qk)+rk,qk+1=qk+sk,F'_\Omega(q_k) s_k = -F_\Omega(q_k) + r_k,\quad q_{k+1} = q_k + s_k,

where rkr_k decays as MM increases.

The asymptotic local linear convergence rate is then γM\gamma^M in an appropriate norm, yielding an error bound

qkq=O((γM)k).\|q_{k}-q_*\|_\infty = O((\gamma^M)^k).

This formalizes the intuition that more evaluation steps per improvement accelerate the local rate, interpolating between pure policy iteration (γ\gamma convergence) and value iteration or full Newton (γM\gamma^M) (Li et al., 2023).

6. Extensions, Function Approximation, and Practical Consequences

In the function approximation regime, as addressed by stationary-reweighted Soft FQI (Laan et al., 30 Dec 2025), the contraction property of the soft Bellman operator holds only in the stationary norm of the soft-optimal policy (μ\mu_*), rather than in the behavior norm typical of standard FQI. To restore contraction, a stationary-reweighting is employed, aligning the regression with the contraction geometry. Formal results guarantee local linear convergence up to an error floor determined by misspecification, density ratio estimation, and finite sample effects.

Continuation approaches—where the temperature parameter τ\tau is annealed from high to low—permit global iterative solution, traversing a sequence of increasingly less regularized problems. Under appropriate margin conditions, contraction persists even as τ0\tau\to 0 (Laan et al., 30 Dec 2025).

Empirically, soft policy iteration and entropy-regularized algorithms demonstrate enhanced exploration, robustness to environmental stochasticity, and improved safety (e.g., avoidance of risky trajectories), at the cost of a controllable sub-optimality gap. Softened improvement steps are also central to frameworks for deep RL, including Soft Modified Policy Iteration (MoSoPI) and MoPPO, which combine partial off-policy evaluation with clipping or trust-region constraints, producing substantial gains in sample efficiency versus on-policy methods (Merdivan et al., 2019, Smirnova et al., 2019).

7. Connections to Broader Algorithmic Families

Soft policy iteration subsumes or unifies a broad range of entropy-regularized RL algorithms, including Soft Q-Learning and Soft Actor-Critic, and provides the theoretical foundation for their convergence rates and stability. The equivalence to Newton–Raphson iterations facilitates use of powerful tools from nonlinear optimization and fixed-point theory. Notably, soft policy iteration guarantees global linear and local quadratic convergence in the presence of strongly convex regularizers, and its generalizations under function approximation and partial evaluation extend its practical impact across the landscape of contemporary reinforcement learning algorithms (Li et al., 2023, Laan et al., 30 Dec 2025, Smirnova et al., 2019, Merdivan et al., 2019).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Soft Policy Iteration.