Papers
Topics
Authors
Recent
Search
2000 character limit reached

Feasible Policy Iteration Methods

Updated 20 May 2026
  • Feasible Policy Iteration is a dynamic programming framework that generalizes classical policy and value iterations to ensure recursive feasibility and safe control.
  • It employs strategies such as safe region identification, λ-policy iteration, and flexible evaluation to overcome limitations in constrained and continuous-state settings.
  • FPI methods offer strong guarantees, including monotonic policy improvement, geometric convergence, and stability in both deterministic and stochastic environments.

Feasible Policy Iteration (FPI) is a class of dynamic programming algorithms that generalize classical policy iteration (PI), value iteration (VI), and λ-policy iteration, with a principal focus on ensuring recursive feasibility, safe control, monotonic improvement, and practical sample efficiency. FPI methodologies have been developed across reinforcement learning (RL), approximate/adaptive dynamic programming, and constrained control, providing rigorous mechanisms for policy evaluation and improvement in both unconstrained and constrained Markov Decision Processes (MDPs), deterministic and stochastic systems, and model-free or model-based frameworks (Yang et al., 2023, Bertsekas, 2015, Granzotto et al., 2022, Gao et al., 2020).

1. Foundations: Classical Policy Iteration and Limitations

Policy Iteration (PI) is a fundamental algorithm for solving discounted or undiscounted infinite-horizon MDPs via alternating steps of policy evaluation and policy improvement. Given a policy πkπ_k, policy evaluation computes its value function (cost-to-go) VπkV^{π_k} or QπkQ^{π_k}, and policy improvement produces a new policy πk+1π_{k+1} that greedily optimizes this value. In unconstrained, finite-state MDPs with exact dynamic programming, PI is guaranteed to converge to the optimal policy in a finite number of iterations. Value iteration (VI) can be seen as the limiting case of PI where each "evaluation" consists of a single Bellman backup.

Despite PI’s power, it is not guaranteed to be recursively feasible in nonlinear, continuous-state, or constrained settings. Recursive feasibility refers to the property that, at each iteration and every state, the policy improvement step admits a solution; this can fail in practice due to discontinuities or unattainability of infima in the control space (Granzotto et al., 2022). Safety constraints, which require the agent to avoid certain "unsafe" regions of the state space across all time, are also not directly handled by classical PI.

2. Formulations of Feasible Policy Iteration

FPI encompasses a family of algorithms developed to address recursive feasibility, constraint satisfaction, and flexible evaluation–improvement trade-offs.

Safe Feasible Policy Iteration

In RL with constraints, a “safe MDP” is defined as (S,A,P,r,γ,h)(\mathcal{S}, \mathcal{A}, P, r, \gamma, h), with h:SRh: \mathcal{S} \to \mathbb{R} specifying pointwise safety violations. The feasible region for a policy π\pi is: Xπ={sSh(st)0 for all t0 under atπ(st)}X^\pi = \{ s \in \mathcal{S} \mid h(s_t) \leq 0 \text{ for all } t\geq 0 \text{ under } a_t \sim \pi(\cdot|s_t) \} and the maximal feasible region XX^* consists of states from which some policy can keep the agent safe perpetually.

FPI for safe RL alternates:

  • Policy Evaluation: Compute the value function VπkV^{\pi_k} as in standard PI.
  • Feasible Region Identification: Via the Constraint Decay Function (CDF), VπkV^{π_k}0 where VπkV^{π_k}1 is the number of steps before violation, iterated using a "risky Bellman" operator.
  • Region-wise Policy Improvement: For VπkV^{π_k}2 in the feasible region (VπkV^{π_k}3), maximize VπkV^{π_k}4 over admissible actions. Outside the region, choose actions that minimize VπkV^{π_k}5, actively steering the policy away from constraint violation (Yang et al., 2023).

λ-Policy Iteration (λ-PI)

λ-PI generalizes between VI and PI using a weighting parameter VπkV^{π_k}6. Defining the VπkV^{π_k}7-Bellman operator,

VπkV^{π_k}8

the iteration consists of

  • Policy Improvement: Update VπkV^{π_k}9 greedily wrt QπkQ^{π_k}0,
  • Partial Policy Evaluation: Compute QπkQ^{π_k}1 via finitely many Bellman sweeps (Bertsekas, 2015).

FPI arises when policy evaluation is performed approximately or with a fixed computation budget (number of sweeps or samples).

Recursive Feasibility (PI+)

PI+ regularizes standard PI’s improvement step by enforcing outer-semicontinuity of the argmin set and selecting minimizers that yield the smallest total cost, ensuring nonempty, compact improvement sets and thus recursive feasibility. Key attributes are:

  • Outer-semicontinuous and lower-semicontinuous regularization.
  • Explicit construction of improvement sets and tie-breaking for uniqueness.
  • Inductive Lyapunov arguments for robust stability (Granzotto et al., 2022).

Flexible Policy Iteration (Flexible PI)

Flexible PI incorporates additional facilities such as:

  • Experience Replay: Off-policy sampling with a prioritized buffer to accelerate data efficiency in function approximation.
  • Supplemental Value: Inclusion of prior value function estimates in stage cost to further guide learning, with decreasing weights.
  • Hybrid Batch/Incremental Updates: Supports both online and batch policy evaluation using least-squares fits to sampled data (Gao et al., 2020).

3. Algorithmic Workflows

The core steps of FPI algorithms, as realized in various formulations described above, can be organized as follows:

Step FPI (safe RL) λ-PI/FPI (classical, (Bertsekas, 2015)) Flexible PI (Gao et al., 2020)
Policy Evaluation QπkQ^{π_k}2 via Bellman operator Partial evaluation via QπkQ^{π_k}3-Bellman Weighted least squares, with replay
Feasible Region ID CDF/Risky Bellman for QπkQ^{π_k}4 Not explicit Not explicit
Policy Improvement Max QπkQ^{π_k}5, region-constrained Greedy improvement wrt policy value Actor update via QπkQ^{π_k}6 minimization
Regularization N/A (safe constraint via QπkQ^{π_k}7) λ to trade-off bias/variance Supplemental value QπkQ^{π_k}8

In model-free settings, function approximation (e.g., neural networks for QπkQ^{π_k}9 or πk+1π_{k+1}0) and cross-entropy or prioritized replay are employed for data efficiency and constraint adherence (Yang et al., 2023, Gao et al., 2020).

4. Theoretical Guarantees

FPI formulations present strong convergence and safety properties under appropriate assumptions:

  • Monotonic Expansion: In safe FPI, the feasible region πk+1π_{k+1}1 expands monotonically, πk+1π_{k+1}2 (Yang et al., 2023).
  • Monotonic Value Improvement: πk+1π_{k+1}3 for all πk+1π_{k+1}4 in the feasible region.
  • Geometric Convergence: Under deterministic dynamics, πk+1π_{k+1}5 and πk+1π_{k+1}6 at a geometric rate πk+1π_{k+1}7 (Yang et al., 2023).
  • Recursive Feasibility and Stability: For PI+ and regularized FPI, improvement sets are guaranteed nonempty, and closed-loop systems exhibit πk+1π_{k+1}8–stability with explicit Lyapunov bounds (Granzotto et al., 2022).
  • Explicit Error and Mismatch Bounds: Uniform explicit bounds (in terms of Lyapunov functions and contractive arguments) govern the error between iterates and the optimal value function.

In approximate settings (e.g., with LSPE(λ), LSTD(λ), or TD(λ)), additional bias–variance–exploration trade-offs are controlled by λ and the weighting matrix in parameterized value spaces (Bertsekas, 2015). Convergence is preserved if contractiveness can be maintained under projection.

5. Implementation Considerations

  • Neural Approximation: Safe FPI for continuous state spaces uses a sigmoid-output neural net for πk+1π_{k+1}9, trained to minimize cross-entropy against the CDF update.
  • Interior-Point Optimization: In the region-constrained improvement step, constrained maximization is solved with interior-point methods when feasible sets are nontrivial (Yang et al., 2023).
  • Experience Replay and Prioritized Sampling: Flexible PI harnesses prioritized memory buffers and weighted least-squares fits to samples with high TD error, improving sample efficiency and convergence speed (Gao et al., 2020).
  • Geometric-Sampling: λ-PI implementations use geometric stopping rules (terminate with probability (S,A,P,r,γ,h)(\mathcal{S}, \mathcal{A}, P, r, \gamma, h)0 at each step) to simulate partial sweeps and induce exploration (Bertsekas, 2015).
  • Empirical Performance: In low-dimensional classic control benchmarks, safe FPI achieves true maximal feasibility (zero constraint violations). In high-dimensional environments (e.g., Safety Gym), FPI yields smooth constraint-satisfying learning and reward curves superior to Lagrangian or CPO approaches (Yang et al., 2023). Flexible PI demonstrates rapid adaptation and convergence in human-robot interaction benchmarks (Gao et al., 2020).

6. Variants, Relation to Other Methods, and Extensions

  • λ-Policy Iteration as FPI: λ-PI is mathematically equivalent to certain feasible PI forms, interpolating between PI and VI. Implementation variants differ in the number of sweeps or rollout lengths (Bertsekas, 2015).
  • Recursive Feasibility Regularization (PI+): Essential for nonlinear or constrained settings where infima may not be attainable, outer-semicontinuity regularization of the improvement map ensures algorithmic progress (Granzotto et al., 2022).
  • Flexible PI: Extends FPI to model-free regimes, supporting off-policy sampling, supplemental prior knowledge, and prioritized data to bridge sample efficiency and system-level stability (Gao et al., 2020).
  • Applications: FPI schemes have been validated in control of prosthetic devices, RL with strict safety requirements, and large-scale games (e.g., Tetris, Ms. Pac-Man) (Bertsekas, 2015, Gao et al., 2020).

7. Summary and Perspectives

Feasible Policy Iteration provides a rigorous, extensible meta-algorithmic paradigm to overcome the limitations of classical PI in the presence of constraints, approximation, and safety requirements. By embedding constraint-aware feasibility logic, partial and flexible evaluation, and data/experience management, FPI algorithms achieve monotonic improvement, geometric convergence to safe or optimal policies, and robust performance in both low and high-dimensional control tasks (Yang et al., 2023, Bertsekas, 2015, Granzotto et al., 2022, Gao et al., 2020). This approach unifies and extends PI, VI, and λ-PI, serving as a foundation for principled safe RL and scalable sample-efficient control.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Feasible Policy Iteration (FPI).