Papers
Topics
Authors
Recent
2000 character limit reached

Policy Iteration Algorithm

Updated 1 January 2026
  • Policy Iteration is a dynamic programming method that alternates between evaluating a current policy and improving it based on Bellman equations.
  • It underpins various reinforcement learning systems, offering strong monotonicity and finite-time convergence guarantees in Markov decision processes.
  • Enhanced variants, including online and multiagent versions, provide practical efficiency trade-offs for complex control and game-theoretic applications.

Policy iteration (PI) is a foundational algorithm in dynamic programming and reinforcement learning, designed to produce optimal or near-optimal stationary feedback policies for sequential decision problems. It is structured as an iterative process, alternating between evaluating the performance of the current policy (policy evaluation) and improving it via local optimality conditions (policy improvement). PI achieves strong monotonicity and finite-time convergence guarantees in standard Markov decision processes (MDPs), forms a basis for algorithms in multiagent, game-theoretic, mean-field, and nonlinear control regimes, and admits efficient extensions to function approximation and distributed systems. The algorithm’s complexity is tightly connected to the combinatorial structure of policy space, and recent advances have resolved both subtle upper and lower bounds for PI’s convergence rate. Variants of PI—e.g., online, optimistic, multiagent, mean-payoff, stochastic games, and modified forms—offer trade-offs between sample efficiency, computational tractability, and theoretical guarantees.

1. Formal Definition and Basic Algorithm

Given a finite state space S\mathcal{S}, a finite action space A\mathcal{A}, a transition kernel P(ss,a)P(s'|s,a), and a reward function r(s,a)r(s,a), PI seeks a policy π\pi^* that maximizes the expected infinite-horizon discounted return: V(s)=maxπEπ[t=0γtr(st,π(st))s0=s]V^*(s) = \max_\pi \mathbb{E}^\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t, \pi(s_t)) \mid s_0=s\right] Algorithmically, PI alternates two steps (Mansour et al., 2013):

  • Policy Evaluation: Solve the Bellman equations for the current policy π\pi:

Vπ(s)=r(s,π(s))+γsP(ss,π(s))Vπ(s)V^\pi(s) = r(s,\pi(s)) + \gamma \sum_{s'} P(s'|s, \pi(s)) V^\pi(s')

  • Policy Improvement: For each state ss, update:

πnew(s)argmaxaA[r(s,a)+γsP(ss,a)Vπ(s)]\pi_{\text{new}}(s) \in \arg\max_{a \in \mathcal{A}} \left[ r(s, a) + \gamma \sum_{s'} P(s'|s, a) V^\pi(s') \right]

Monotonic improvement ensures that the value sequence is non-decreasing and terminates in at most AS|\mathcal{A}|^{|\mathcal{S}|} iterations.

2. Complexity, Upper and Lower Bounds

The iteration complexity of PI is determined by the structure of improvement sets and the combinatorics of policy space. Mansour and Singh (Mansour et al., 2013) prove that for the greedy variant (simultaneous improvement in all states), the number of iterations is:

  • O(2n/n)O(2^n/n) for k=2k=2 actions,
  • O(kn/n)O(k^n/n) for general kk-action MDPs.

Randomized PI (where improvements are applied probabilistically) enjoys similar sub-exponential bounds:

  • O(20.78n)O(2^{0.78n}) for two actions,
  • O([(1+ϵk)k/2]n)O([(1+\epsilon_k)k/2]^n) for large kk, with ϵk0\epsilon_k \to 0 as kk \to \infty.

Contrastingly, worst-case lower bounds scale as Ω(kn/2)\Omega(k^{n/2}) for single-switch deterministic variants and Ω(kn/2/logk)\Omega(k^{n/2}/\log k) in randomized settings, demonstrating exponential iteration complexity for large action spaces (Ashutosh et al., 2020). These results elucidate a strong gap between typical practical performance and worst-case theoretical limitations.

3. Extensions: Online, Multiagent, and Nonlinear Systems

PI is extended in multiple directions to accommodate practical constraints:

  • Online PI performs policy improvement only at the currently visited state, yielding locally optimal policies on the recurrent set of states. Monotonic improvement and finite-time convergence are preserved, and with sufficient exploration, global optimality is achieved (Bertsekas, 2021).
  • Multiagent PI (agent-by-agent improvement) cycles through agents, updating actions sequentially. This reduces policy improvement complexity from exponential to linear in the number of agents but produces only agent-by-agent optimal policies unless additional conditions hold. Such schemes support distributed computation and are especially suited for large-scale MARL problems (Bertsekas, 2020).
  • Nonlinear discrete-time systems: Recursive feasibility is critical for the applicability of PI in nonlinear settings. The PI+ algorithm enforces outer semicontinuous regularization of the improvement map, guaranteeing recursive feasibility and establishing near-optimality and robust KL\mathcal{K}\mathcal{L}-stability properties (Granzotto et al., 2022).

4. Policy Iteration for Games and Mean-Payoff Problems

Zero-sum Markov and stochastic games generalize standard PI to two-player and multichain mean-payoff settings. In such cases, the Bellman operator is replaced by the Shapley operator, involving max–min (or min–max) optimization. Proposed efficient PI variants for zero-sum Markov games employ lookahead policies and rollout-based evaluation, ensuring exponential convergence under certain contraction conditions, with substantial computational advantages over classical schemes (Winnicki et al., 2023).

For multichain mean-payoff stochastic games with perfect information, PI is combined with nonlinear spectral projection to handle degenerate iterations and prevent cycling. The algorithm ensures lexicographical monotonicity and finite termination through a reduction of the relative-value vector on critical nodes, even in large-scale or sparse instances (Akian et al., 2012).

5. Policy Iteration in Optimal and Stochastic Control

In continuous-time, nonlinear, and high-dimensional control, PI typically alternates approximate solutions to Hamilton-Jacobi-Bellman (HJB) equations. Strong theoretical results establish convergence rates provided that the so-called Generalized HJB (GHJB) admits sufficiently regular, forward-invariant solutions throughout the PI sequence (Ehring et al., 14 Jul 2025). For non-Markovian stochastic control, probabilistic BSDE-based PI variants approximate the value function and optimal controls via penalized expectations, recovering exponential convergence rates—even in volatility-control and multi-dimensional state cases—though monotonicity may not be strictly maintained (Possamaï et al., 2024).

6. Modified and Approximate Policy Iteration

Modified Policy Iteration (MPI) generalizes PI and value iteration by introducing a parameter mm controlling the number of evaluation steps per iteration (Scherrer et al., 2012). This yields a spectrum of algorithms interpolating between pure VI (m=1m=1) and PI (m=m=\infty). Approximate and classification-based implementations retain unified error propagation guarantees, with explicit bias–variance tradeoff determined by mm and sample allocation.

7. Practical Implementations and Applications

PI underlies a substantial portion of state-of-the-art dynamic programming and RL systems, from grid-based dynamic programming (Alla et al., 2013) to high-dimensional mean-field games (Cacace et al., 2020). Numerical experiments establish PI’s superlinear local convergence, robustness to problem dimension, and practical efficiency relative to Newton-type and value iteration algorithms. In multiagent, deep RL, and MFG domains, PI-based algorithms offer scalable solvers with global convergence and efficiency.

Table: Key PI Variants and Their Guarantees

Variant Convergence Rate Policy Optimality
Classical PI Monotonic, finite-time Global
Online PI Finite-time (local) Local (recurrent set)
Multiagent PI Finite-time (person-by-person) Agent-by-agent optimal
Mean-payoff PI Finite, lex order Mean-payoff equilibrium
Modified PI Interpolated, unified error Global (in limit)
BSDE-based PI Exponential (probabilistic) Global

The spectrum of policy iteration algorithms offers tradeoffs between computational complexity, sample efficiency, and policy optimality, dictated by the underlying problem structure and chosen improvement/evaluation schemes. Rigorous complexity theory and convergence analysis have cemented PI as a central method in both classical and modern dynamic optimization.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Policy Iteration Algorithm.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube