Policy Iteration Algorithm

Updated 1 January 2026

Policy Iteration is a dynamic programming method that alternates between evaluating a current policy and improving it based on Bellman equations.
It underpins various reinforcement learning systems, offering strong monotonicity and finite-time convergence guarantees in Markov decision processes.
Enhanced variants, including online and multiagent versions, provide practical efficiency trade-offs for complex control and game-theoretic applications.

Policy iteration (PI) is a foundational algorithm in dynamic programming and reinforcement learning, designed to produce optimal or near-optimal stationary feedback policies for sequential decision problems. It is structured as an iterative process, alternating between evaluating the performance of the current policy (policy evaluation) and improving it via local optimality conditions (policy improvement). PI achieves strong monotonicity and finite-time convergence guarantees in standard Markov decision processes (MDPs), forms a basis for algorithms in multiagent, game-theoretic, mean-field, and nonlinear control regimes, and admits efficient extensions to function approximation and distributed systems. The algorithm’s complexity is tightly connected to the combinatorial structure of policy space, and recent advances have resolved both subtle upper and lower bounds for PI’s convergence rate. Variants of PI—e.g., online, optimistic, multiagent, mean-payoff, stochastic games, and modified forms—offer trade-offs between sample efficiency, computational tractability, and theoretical guarantees.

1. Formal Definition and Basic Algorithm

Given a finite state space $\mathcal{S}$ , a finite action space $\mathcal{A}$ , a transition kernel $P(s'|s,a)$ , and a reward function $r(s,a)$ , PI seeks a policy $\pi^*$ that maximizes the expected infinite-horizon discounted return: $V^*(s) = \max_\pi \mathbb{E}^\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(s_t)) \mid s_0=s\right]$ Algorithmically, PI alternates two steps (Mansour et al., 2013):

Policy Evaluation: Solve the Bellman equations for the current policy $\pi$ :

$V^\pi(s) = r(s,\pi(s)) + \gamma \sum_{s'} P(s'|s, \pi(s)) V^\pi(s')$

Policy Improvement: For each state $s$ , update:

$\pi_{\text{new}}(s) \in \arg\max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s'} P(s'|s, a) V^\pi(s') \right]$

Monotonic improvement ensures that the value sequence is non-decreasing and terminates in at most $|\mathcal{A}|^{|\mathcal{S}|}$ iterations.

2. Complexity, Upper and Lower Bounds

The iteration complexity of PI is determined by the structure of improvement sets and the combinatorics of policy space. Mansour and Singh (Mansour et al., 2013) prove that for the greedy variant (simultaneous improvement in all states), the number of iterations is:

$O(2^n/n)$ for $k=2$ actions,
$O(k^n/n)$ for general $k$ -action MDPs.

Randomized PI (where improvements are applied probabilistically) enjoys similar sub-exponential bounds:

$O(2^{0.78n})$ for two actions,
$O([(1+\epsilon_k)k/2]^n)$ for large $k$ , with $\epsilon_k \to 0$ as $k \to \infty$ .

Contrastingly, worst-case lower bounds scale as $\Omega(k^{n/2})$ for single-switch deterministic variants and $\Omega(k^{n/2}/\log k)$ in randomized settings, demonstrating exponential iteration complexity for large action spaces (Ashutosh et al., 2020). These results elucidate a strong gap between typical practical performance and worst-case theoretical limitations.

3. Extensions: Online, Multiagent, and Nonlinear Systems

PI is extended in multiple directions to accommodate practical constraints:

Online PI performs policy improvement only at the currently visited state, yielding locally optimal policies on the recurrent set of states. Monotonic improvement and finite-time convergence are preserved, and with sufficient exploration, global optimality is achieved (Bertsekas, 2021).
Multiagent PI (agent-by-agent improvement) cycles through agents, updating actions sequentially. This reduces policy improvement complexity from exponential to linear in the number of agents but produces only agent-by-agent optimal policies unless additional conditions hold. Such schemes support distributed computation and are especially suited for large-scale MARL problems (Bertsekas, 2020).
Nonlinear discrete-time systems: Recursive feasibility is critical for the applicability of PI in nonlinear settings. The PI+ algorithm enforces outer semicontinuous regularization of the improvement map, guaranteeing recursive feasibility and establishing near-optimality and robust $\mathcal{K}\mathcal{L}$ -stability properties (Granzotto et al., 2022).

4. Policy Iteration for Games and Mean-Payoff Problems

Zero-sum Markov and stochastic games generalize standard PI to two-player and multichain mean-payoff settings. In such cases, the Bellman operator is replaced by the Shapley operator, involving max–min (or min–max) optimization. Proposed efficient PI variants for zero-sum Markov games employ lookahead policies and rollout-based evaluation, ensuring exponential convergence under certain contraction conditions, with substantial computational advantages over classical schemes (Winnicki et al., 2023).

For multichain mean-payoff stochastic games with perfect information, PI is combined with nonlinear spectral projection to handle degenerate iterations and prevent cycling. The algorithm ensures lexicographical monotonicity and finite termination through a reduction of the relative-value vector on critical nodes, even in large-scale or sparse instances (Akian et al., 2012).

5. Policy Iteration in Optimal and Stochastic Control

In continuous-time, nonlinear, and high-dimensional control, PI typically alternates approximate solutions to Hamilton-Jacobi-Bellman (HJB) equations. Strong theoretical results establish convergence rates provided that the so-called Generalized HJB (GHJB) admits sufficiently regular, forward-invariant solutions throughout the PI sequence (Ehring et al., 14 Jul 2025). For non-Markovian stochastic control, probabilistic BSDE-based PI variants approximate the value function and optimal controls via penalized expectations, recovering exponential convergence rates—even in volatility-control and multi-dimensional state cases—though monotonicity may not be strictly maintained (Possamaï et al., 2024).

6. Modified and Approximate Policy Iteration

Modified Policy Iteration (MPI) generalizes PI and value iteration by introducing a parameter $m$ controlling the number of evaluation steps per iteration (Scherrer et al., 2012). This yields a spectrum of algorithms interpolating between pure VI ( $m=1$ ) and PI ( $m=\infty$ ). Approximate and classification-based implementations retain unified error propagation guarantees, with explicit bias–variance tradeoff determined by $m$ and sample allocation.

7. Practical Implementations and Applications

PI underlies a substantial portion of state-of-the-art dynamic programming and RL systems, from grid-based dynamic programming (Alla et al., 2013) to high-dimensional mean-field games (Cacace et al., 2020). Numerical experiments establish PI’s superlinear local convergence, robustness to problem dimension, and practical efficiency relative to Newton-type and value iteration algorithms. In multiagent, deep RL, and MFG domains, PI-based algorithms offer scalable solvers with global convergence and efficiency.

Table: Key PI Variants and Their Guarantees

Variant	Convergence Rate	Policy Optimality
Classical PI	Monotonic, finite-time	Global
Online PI	Finite-time (local)	Local (recurrent set)
Multiagent PI	Finite-time (person-by-person)	Agent-by-agent optimal
Mean-payoff PI	Finite, lex order	Mean-payoff equilibrium
Modified PI	Interpolated, unified error	Global (in limit)
BSDE-based PI	Exponential (probabilistic)	Global

The spectrum of policy iteration algorithms offers tradeoffs between computational complexity, sample efficiency, and policy optimality, dictated by the underlying problem structure and chosen improvement/evaluation schemes. Rigorous complexity theory and convergence analysis have cemented PI as a central method in both classical and modern dynamic optimization.