Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

Published 30 Jan 2026 in cs.AI and cs.CC | (2601.23229v1)

Abstract: Markov decision processes (MDPs) are a fundamental model in sequential decision making. Robust MDPs (RMDPs) extend this framework by allowing uncertainty in transition probabilities and optimizing against the worst-case realization of that uncertainty. In particular, $(s, a)$-rectangular RMDPs with $L_\infty$ uncertainty sets form a fundamental and expressive model: they subsume classical MDPs and turn-based stochastic games. We consider this model with discounted payoffs. The existence of polynomial and strongly-polynomial time algorithms is a fundamental problem for these optimization models. For MDPs, linear programming yields polynomial-time algorithms for any arbitrary discount factor, and the seminal work of Ye established strongly--polynomial time for a fixed discount factor. The generalization of such results to RMDPs has remained an important open problem. In this work, we show that a robust policy iteration algorithm runs in strongly-polynomial time for $(s, a)$-rectangular $L_\infty$ RMDPs with a constant (fixed) discount factor, resolving an important algorithmic question.

Abstract PDF Upgrade to Chat

Summary

The paper proves that robust policy iteration for (s,a)-rectangular L∞ MDPs achieves strongly polynomial time complexity.
A novel potential function and combinatorial analysis of MSB transitions provide rigorous bounds on policy improvement steps.
The methodology extends to both robust Markov chains and general robust MDPs, ensuring scalability for uncertainty models.

Strongly Polynomial Policy Iteration Algorithms for $L_\infty$ Robust MDPs

Introduction and Motivation

This paper addresses a significant problem in robust sequential decision making: the algorithmic complexity of solving Markov Decision Processes (MDPs) under $L_\infty$ -rectangular uncertainty (RMDPs). In many applications, MDP transition kernels are uncertain due to estimation error or incomplete knowledge. The robust MDP framework mitigates this by evaluating policies against the worst-case realization within a structured uncertainty set. Among several uncertainty models, $(s, a)$ -rectangular $L_\infty$ sets are especially relevant—they allow uncertainty to be modeled independently for each state-action pair, echoing practical data-driven use cases and preserving computational tractability via robust dynamic programming.

For classical (non-robust) MDPs, polynomial-time algorithms exist for the discounted setting, e.g., via linear programming, and seminal work has established strongly polynomial methods for fixed discount factors. For robust counterparts, although polynomial-time results are available for value approximation, the existence of strongly polynomial policy iteration algorithms for exact value computation in $(s,a)$ -rectangular $L_\infty$ RMDPs remained unresolved. This work settles the open problem, proving that robust policy iteration with a fixed discount factor has strongly polynomial complexity for this important class.

Problem Formalism and Algorithmic Setting

The focus is on finite-state discounted robust MDPs with $(s,a)$ -rectangular $L_\infty$ uncertainty sets around a nominal kernel $P_{s,a}$ , with coordinate-wise radii $\delta_{s, a}$ . The setting generalizes classical MDPs, and—crucially—encapsulates turn-based stochastic games via polynomial reductions.

Two key policy iteration variants are considered:

RMC-PI: For robust Markov chains (a trivial singleton action set per state).
RMDP-PI: For general robust MDPs.

Both variants alternate between policy evaluation (exact value computation under fixed policies) and greedy policy improvement (maximization over policies within uncertainty sets). For the $L_\infty$ model, optimal transitions can be efficiently computed using a homotopy-based mass-transfer algorithm, which, for each state, reallocates local probability to maximize/minimize Bellman values under the given constraints in $O(n\log n)$ time per state.

Technical Contributions

Novel Potential Function and Combinatorial Analysis

A central technical innovation is the introduction of a potential function, defined over probability mass transfers between transition kernels associated with the policy and its optimal counterpart. This function measures the improvement potential and enables a fine-grained analysis of policy iteration progression. The authors provide upper and lower bounds relating this potential to the value-optimality gap.

To bound overall complexity, the paper studies the evolution of the most significant bits (MSBs) in the binary representation of transition probability differences. Leveraging combinatorial arguments (including new structural lemmas resembling applications of Siegel's lemma), the analysis shows there are only $O(n \log n)$ possible MSBs for discrepancies, and each critical mass-transfer diminishes by at least half every $O(\log_\gamma(1/(2n^2)))$ steps. Critically, this precludes the possibility of long periods of insignificant improvement per iteration.

Furthermore, a mathematical strengthening of the combinatorial lemma was provided independently by the agent Aletheia, as recognized in the paper—a relevant contribution to autonomous math research.

Strongly Polynomial Policy Iteration for Robust Markov Chains

For RMC-PI, the number of improvement steps is shown to be

$O\left(n^4 \log n \cdot \frac{\log((1-\gamma)/n^2)}{\log(1/\gamma)}\right)$

where $n$ is the number of states. Each iteration involves only strongly polynomial primal-dual computations, and both the bit-lengths and number of steps are bounded polynomially in the input sizes of state space and uncertainty set description but are independent of the cost function coefficients.

Extension to Robust Markov Decision Processes

For the general RMDP-PI, the analysis is adapted with a similar pipeline. The potential function here corresponds to action-level optimality gaps. After ruling out repeated non-improving actions, the total number of improvement steps is

$O\left(n \cdot m \cdot \frac{\log(1-\gamma)}{\log \gamma}\right)$

with $n$ states and $m$ actions. The complexity is thus strongly polynomial in problem size for any fixed discount factor $\gamma$ .

Crucially, the paper provides the first proofs that policy iteration is a strongly polynomial algorithm for both robust chains and full RMDPs under $(s,a)$ -rectangular $L_\infty$ uncertainty.

Implications and Future Directions

The central result—demonstrating strongly polynomial-time policy iteration for this class of robust MDPs—enumerates an explicit, efficient algorithm whose step count and arithmetic growth depend solely on the problem's combinatorial structure and not the numerical precision of the input data. This solidifies the theoretical tractability of robust planning under realistic uncertainty models and precludes pathological slowdowns previously possible with naive reductions to stochastic games.

Practically, this justifies the use of policy iteration for large RMDPs with uncertainty, provided the uncertainty is $L_\infty$ -rectangular, and supplies explicit complexity guarantees. The mathematical apparatus provided (especially the combinatorial bounds and potential-based progress measures) may inspire similar analyses for broader classes (e.g., $L_1$ or non-rectangular uncertainty), or for other solution paradigms such as value iteration.

Theoretically, the work resolves a notable open complexity-theoretic question, confirming that the extension to robust models introduces no fundamental barrier to strongly polynomial planning, at least in the $(s,a)$ -rectangular $L_\infty$ -discounted case. It also directly informs the complexity of turn-based stochastic games, given the reduction results.

Conclusion

This paper proves that robust policy iteration for $(s,a)$ -rectangular $L_\infty$ -uncertain discounted MDPs is strongly polynomial. The main advances are a novel potential function to measure algorithmic progress, a sharp combinatorial analysis of possible MSBs in transition difference trajectories, and complexity bounds that neither depend on input cost/transition bit-sizes nor deteriorate due to the uncertainty set’s geometry. The results close a long-standing open problem regarding the complexity of robust MDP planning and provide the foundation for further research on robust reinforcement learning and sequential decision making under uncertainty (2601.23229).

Markdown

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

Summary

Strongly Polynomial Policy Iteration Algorithms for $L_\infty$ Robust MDPs

Introduction and Motivation

Problem Formalism and Algorithmic Setting

Technical Contributions

Novel Potential Function and Combinatorial Analysis

Strongly Polynomial Policy Iteration for Robust Markov Chains

Extension to Robust Markov Decision Processes

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Strongly Polynomial Time Complexity of Policy Iteration for $L_\infty$ Robust MDPs

Summary

Strongly Polynomial Policy Iteration Algorithms for L∞L_\inftyL∞​ Robust MDPs

Introduction and Motivation

Problem Formalism and Algorithmic Setting

Technical Contributions

Novel Potential Function and Combinatorial Analysis

Strongly Polynomial Policy Iteration for Robust Markov Chains

Extension to Robust Markov Decision Processes

Implications and Future Directions

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Open Problems

Continue Learning

Related Papers

Authors (6)

Collections

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research

Strongly Polynomial Policy Iteration Algorithms for $L_\infty$ Robust MDPs