Papers
Topics
Authors
Recent
Search
2000 character limit reached

Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

Published 30 Jan 2026 in cs.AI and cs.CC | (2601.23229v1)

Abstract: Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.

Summary

  • The paper proves that robust policy iteration for (s,a)-rectangular L∞ MDPs achieves strongly polynomial time complexity.
  • A novel potential function and combinatorial analysis of MSB transitions provide rigorous bounds on policy improvement steps.
  • The methodology extends to both robust Markov chains and general robust MDPs, ensuring scalability for uncertainty models.

Strongly Polynomial Policy Iteration Algorithms for LL_\infty Robust MDPs

Introduction and Motivation

This paper addresses a significant problem in robust sequential decision making: the algorithmic complexity of solving Markov Decision Processes (MDPs) under LL_\infty-rectangular uncertainty (RMDPs). In many applications, MDP transition kernels are uncertain due to estimation error or incomplete knowledge. The robust MDP framework mitigates this by evaluating policies against the worst-case realization within a structured uncertainty set. Among several uncertainty models, (s,a)(s, a)-rectangular LL_\infty sets are especially relevant—they allow uncertainty to be modeled independently for each state-action pair, echoing practical data-driven use cases and preserving computational tractability via robust dynamic programming.

For classical (non-robust) MDPs, polynomial-time algorithms exist for the discounted setting, e.g., via linear programming, and seminal work has established strongly polynomial methods for fixed discount factors. For robust counterparts, although polynomial-time results are available for value approximation, the existence of strongly polynomial policy iteration algorithms for exact value computation in (s,a)(s,a)-rectangular LL_\infty RMDPs remained unresolved. This work settles the open problem, proving that robust policy iteration with a fixed discount factor has strongly polynomial complexity for this important class.

Problem Formalism and Algorithmic Setting

The focus is on finite-state discounted robust MDPs with (s,a)(s,a)-rectangular LL_\infty uncertainty sets around a nominal kernel Ps,aP_{s,a}, with coordinate-wise radii δs,a\delta_{s, a}. The setting generalizes classical MDPs, and—crucially—encapsulates turn-based stochastic games via polynomial reductions.

Two key policy iteration variants are considered:

  • RMC-PI: For robust Markov chains (a trivial singleton action set per state).
  • RMDP-PI: For general robust MDPs.

Both variants alternate between policy evaluation (exact value computation under fixed policies) and greedy policy improvement (maximization over policies within uncertainty sets). For the LL_\infty model, optimal transitions can be efficiently computed using a homotopy-based mass-transfer algorithm, which, for each state, reallocates local probability to maximize/minimize Bellman values under the given constraints in O(nlogn)O(n\log n) time per state.

Technical Contributions

Novel Potential Function and Combinatorial Analysis

A central technical innovation is the introduction of a potential function, defined over probability mass transfers between transition kernels associated with the policy and its optimal counterpart. This function measures the improvement potential and enables a fine-grained analysis of policy iteration progression. The authors provide upper and lower bounds relating this potential to the value-optimality gap.

To bound overall complexity, the paper studies the evolution of the most significant bits (MSBs) in the binary representation of transition probability differences. Leveraging combinatorial arguments (including new structural lemmas resembling applications of Siegel's lemma), the analysis shows there are only O(nlogn)O(n \log n) possible MSBs for discrepancies, and each critical mass-transfer diminishes by at least half every O(logγ(1/(2n2)))O(\log_\gamma(1/(2n^2))) steps. Critically, this precludes the possibility of long periods of insignificant improvement per iteration.

Furthermore, a mathematical strengthening of the combinatorial lemma was provided independently by the agent Aletheia, as recognized in the paper—a relevant contribution to autonomous math research.

Strongly Polynomial Policy Iteration for Robust Markov Chains

For RMC-PI, the number of improvement steps is shown to be

O(n4lognlog((1γ)/n2)log(1/γ))O\left(n^4 \log n \cdot \frac{\log((1-\gamma)/n^2)}{\log(1/\gamma)}\right)

where nn is the number of states. Each iteration involves only strongly polynomial primal-dual computations, and both the bit-lengths and number of steps are bounded polynomially in the input sizes of state space and uncertainty set description but are independent of the cost function coefficients.

Extension to Robust Markov Decision Processes

For the general RMDP-PI, the analysis is adapted with a similar pipeline. The potential function here corresponds to action-level optimality gaps. After ruling out repeated non-improving actions, the total number of improvement steps is

O(nmlog(1γ)logγ)O\left(n \cdot m \cdot \frac{\log(1-\gamma)}{\log \gamma}\right)

with nn states and mm actions. The complexity is thus strongly polynomial in problem size for any fixed discount factor γ\gamma.

Crucially, the paper provides the first proofs that policy iteration is a strongly polynomial algorithm for both robust chains and full RMDPs under (s,a)(s,a)-rectangular LL_\infty uncertainty.

Implications and Future Directions

The central result—demonstrating strongly polynomial-time policy iteration for this class of robust MDPs—enumerates an explicit, efficient algorithm whose step count and arithmetic growth depend solely on the problem's combinatorial structure and not the numerical precision of the input data. This solidifies the theoretical tractability of robust planning under realistic uncertainty models and precludes pathological slowdowns previously possible with naive reductions to stochastic games.

Practically, this justifies the use of policy iteration for large RMDPs with uncertainty, provided the uncertainty is LL_\infty-rectangular, and supplies explicit complexity guarantees. The mathematical apparatus provided (especially the combinatorial bounds and potential-based progress measures) may inspire similar analyses for broader classes (e.g., L1L_1 or non-rectangular uncertainty), or for other solution paradigms such as value iteration.

Theoretically, the work resolves a notable open complexity-theoretic question, confirming that the extension to robust models introduces no fundamental barrier to strongly polynomial planning, at least in the (s,a)(s,a)-rectangular LL_\infty-discounted case. It also directly informs the complexity of turn-based stochastic games, given the reduction results.

Conclusion

This paper proves that robust policy iteration for (s,a)(s,a)-rectangular LL_\infty-uncertain discounted MDPs is strongly polynomial. The main advances are a novel potential function to measure algorithmic progress, a sharp combinatorial analysis of possible MSBs in transition difference trajectories, and complexity bounds that neither depend on input cost/transition bit-sizes nor deteriorate due to the uncertainty set’s geometry. The results close a long-standing open problem regarding the complexity of robust MDP planning and provide the foundation for further research on robust reinforcement learning and sequential decision making under uncertainty (2601.23229).

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.