Feasible Policy Iteration Methods

Updated 20 May 2026

Feasible Policy Iteration is a dynamic programming framework that generalizes classical policy and value iterations to ensure recursive feasibility and safe control.
It employs strategies such as safe region identification, λ-policy iteration, and flexible evaluation to overcome limitations in constrained and continuous-state settings.
FPI methods offer strong guarantees, including monotonic policy improvement, geometric convergence, and stability in both deterministic and stochastic environments.

Feasible Policy Iteration (FPI) is a class of dynamic programming algorithms that generalize classical policy iteration (PI), value iteration (VI), and λ-policy iteration, with a principal focus on ensuring recursive feasibility, safe control, monotonic improvement, and practical sample efficiency. FPI methodologies have been developed across reinforcement learning (RL), approximate/adaptive dynamic programming, and constrained control, providing rigorous mechanisms for policy evaluation and improvement in both unconstrained and constrained Markov Decision Processes (MDPs), deterministic and stochastic systems, and model-free or model-based frameworks (Yang et al., 2023, Bertsekas, 2015, Granzotto et al., 2022, Gao et al., 2020).

1. Foundations: Classical Policy Iteration and Limitations

Policy Iteration (PI) is a fundamental algorithm for solving discounted or undiscounted infinite-horizon MDPs via alternating steps of policy evaluation and policy improvement. Given a policy $π_k$ , policy evaluation computes its value function (cost-to-go) $V^{π_k}$ or $Q^{π_k}$ , and policy improvement produces a new policy $π_{k+1}$ that greedily optimizes this value. In unconstrained, finite-state MDPs with exact dynamic programming, PI is guaranteed to converge to the optimal policy in a finite number of iterations. Value iteration (VI) can be seen as the limiting case of PI where each "evaluation" consists of a single Bellman backup.

Despite PI’s power, it is not guaranteed to be recursively feasible in nonlinear, continuous-state, or constrained settings. Recursive feasibility refers to the property that, at each iteration and every state, the policy improvement step admits a solution; this can fail in practice due to discontinuities or unattainability of infima in the control space (Granzotto et al., 2022). Safety constraints, which require the agent to avoid certain "unsafe" regions of the state space across all time, are also not directly handled by classical PI.

2. Formulations of Feasible Policy Iteration

FPI encompasses a family of algorithms developed to address recursive feasibility, constraint satisfaction, and flexible evaluation–improvement trade-offs.

Safe Feasible Policy Iteration

In RL with constraints, a “safe MDP” is defined as $(\mathcal{S}, \mathcal{A}, P, r, \gamma, h)$ , with $h: \mathcal{S} \to \mathbb{R}$ specifying pointwise safety violations. The feasible region for a policy $\pi$ is: $X^\pi = \{ s \in \mathcal{S} \mid h(s_t) \leq 0 \text{ for all } t\geq 0 \text{ under } a_t \sim \pi(\cdot|s_t) \}$ and the maximal feasible region $X^*$ consists of states from which some policy can keep the agent safe perpetually.

FPI for safe RL alternates:

Policy Evaluation: Compute the value function $V^{\pi_k}$ as in standard PI.
Feasible Region Identification: Via the Constraint Decay Function (CDF), $V^{π_k}$ 0 where $V^{π_k}$ 1 is the number of steps before violation, iterated using a "risky Bellman" operator.
Region-wise Policy Improvement: For $V^{π_k}$ 2 in the feasible region ( $V^{π_k}$ 3), maximize $V^{π_k}$ 4 over admissible actions. Outside the region, choose actions that minimize $V^{π_k}$ 5, actively steering the policy away from constraint violation (Yang et al., 2023).

λ-Policy Iteration (λ-PI)

λ-PI generalizes between VI and PI using a weighting parameter $V^{π_k}$ 6. Defining the $V^{π_k}$ 7-Bellman operator,

$V^{π_k}$ 8

the iteration consists of

Policy Improvement: Update $V^{π_k}$ 9 greedily wrt $Q^{π_k}$ 0,
Partial Policy Evaluation: Compute $Q^{π_k}$ 1 via finitely many Bellman sweeps (Bertsekas, 2015).

FPI arises when policy evaluation is performed approximately or with a fixed computation budget (number of sweeps or samples).

Recursive Feasibility (PI+)

PI+ regularizes standard PI’s improvement step by enforcing outer-semicontinuity of the argmin set and selecting minimizers that yield the smallest total cost, ensuring nonempty, compact improvement sets and thus recursive feasibility. Key attributes are:

Outer-semicontinuous and lower-semicontinuous regularization.
Explicit construction of improvement sets and tie-breaking for uniqueness.
Inductive Lyapunov arguments for robust stability (Granzotto et al., 2022).

Flexible Policy Iteration (Flexible PI)

Flexible PI incorporates additional facilities such as:

Experience Replay: Off-policy sampling with a prioritized buffer to accelerate data efficiency in function approximation.
Supplemental Value: Inclusion of prior value function estimates in stage cost to further guide learning, with decreasing weights.
Hybrid Batch/Incremental Updates: Supports both online and batch policy evaluation using least-squares fits to sampled data (Gao et al., 2020).

3. Algorithmic Workflows

The core steps of FPI algorithms, as realized in various formulations described above, can be organized as follows:

Step	FPI (safe RL)	λ-PI/FPI (classical, (Bertsekas, 2015))	Flexible PI (Gao et al., 2020)
Policy Evaluation	$Q^{π_k}$ 2 via Bellman operator	Partial evaluation via $Q^{π_k}$ 3-Bellman	Weighted least squares, with replay
Feasible Region ID	CDF/Risky Bellman for $Q^{π_k}$ 4	Not explicit	Not explicit
Policy Improvement	Max $Q^{π_k}$ 5, region-constrained	Greedy improvement wrt policy value	Actor update via $Q^{π_k}$ 6 minimization
Regularization	N/A (safe constraint via $Q^{π_k}$ 7)	λ to trade-off bias/variance	Supplemental value $Q^{π_k}$ 8

In model-free settings, function approximation (e.g., neural networks for $Q^{π_k}$ 9 or $π_{k+1}$ 0) and cross-entropy or prioritized replay are employed for data efficiency and constraint adherence (Yang et al., 2023, Gao et al., 2020).

4. Theoretical Guarantees

FPI formulations present strong convergence and safety properties under appropriate assumptions:

Monotonic Expansion: In safe FPI, the feasible region $π_{k+1}$ 1 expands monotonically, $π_{k+1}$ 2 (Yang et al., 2023).
Monotonic Value Improvement: $π_{k+1}$ 3 for all $π_{k+1}$ 4 in the feasible region.
Geometric Convergence: Under deterministic dynamics, $π_{k+1}$ 5 and $π_{k+1}$ 6 at a geometric rate $π_{k+1}$ 7 (Yang et al., 2023).
Recursive Feasibility and Stability: For PI+ and regularized FPI, improvement sets are guaranteed nonempty, and closed-loop systems exhibit $π_{k+1}$ 8–stability with explicit Lyapunov bounds (Granzotto et al., 2022).
Explicit Error and Mismatch Bounds: Uniform explicit bounds (in terms of Lyapunov functions and contractive arguments) govern the error between iterates and the optimal value function.

In approximate settings (e.g., with LSPE(λ), LSTD(λ), or TD(λ)), additional bias–variance–exploration trade-offs are controlled by λ and the weighting matrix in parameterized value spaces (Bertsekas, 2015). Convergence is preserved if contractiveness can be maintained under projection.

5. Implementation Considerations

Neural Approximation: Safe FPI for continuous state spaces uses a sigmoid-output neural net for $π_{k+1}$ 9, trained to minimize cross-entropy against the CDF update.
Interior-Point Optimization: In the region-constrained improvement step, constrained maximization is solved with interior-point methods when feasible sets are nontrivial (Yang et al., 2023).
Experience Replay and Prioritized Sampling: Flexible PI harnesses prioritized memory buffers and weighted least-squares fits to samples with high TD error, improving sample efficiency and convergence speed (Gao et al., 2020).
Geometric-Sampling: λ-PI implementations use geometric stopping rules (terminate with probability $(\mathcal{S}, \mathcal{A}, P, r, \gamma, h)$ 0 at each step) to simulate partial sweeps and induce exploration (Bertsekas, 2015).
Empirical Performance: In low-dimensional classic control benchmarks, safe FPI achieves true maximal feasibility (zero constraint violations). In high-dimensional environments (e.g., Safety Gym), FPI yields smooth constraint-satisfying learning and reward curves superior to Lagrangian or CPO approaches (Yang et al., 2023). Flexible PI demonstrates rapid adaptation and convergence in human-robot interaction benchmarks (Gao et al., 2020).

6. Variants, Relation to Other Methods, and Extensions

λ-Policy Iteration as FPI: λ-PI is mathematically equivalent to certain feasible PI forms, interpolating between PI and VI. Implementation variants differ in the number of sweeps or rollout lengths (Bertsekas, 2015).
Recursive Feasibility Regularization (PI+): Essential for nonlinear or constrained settings where infima may not be attainable, outer-semicontinuity regularization of the improvement map ensures algorithmic progress (Granzotto et al., 2022).
Flexible PI: Extends FPI to model-free regimes, supporting off-policy sampling, supplemental prior knowledge, and prioritized data to bridge sample efficiency and system-level stability (Gao et al., 2020).
Applications: FPI schemes have been validated in control of prosthetic devices, RL with strict safety requirements, and large-scale games (e.g., Tetris, Ms. Pac-Man) (Bertsekas, 2015, Gao et al., 2020).

7. Summary and Perspectives

Feasible Policy Iteration provides a rigorous, extensible meta-algorithmic paradigm to overcome the limitations of classical PI in the presence of constraints, approximation, and safety requirements. By embedding constraint-aware feasibility logic, partial and flexible evaluation, and data/experience management, FPI algorithms achieve monotonic improvement, geometric convergence to safe or optimal policies, and robust performance in both low and high-dimensional control tasks (Yang et al., 2023, Bertsekas, 2015, Granzotto et al., 2022, Gao et al., 2020). This approach unifies and extends PI, VI, and λ-PI, serving as a foundation for principled safe RL and scalable sample-efficient control.