- The paper proves that robust policy iteration for (s,a)-rectangular L∞ MDPs achieves strongly polynomial time complexity.
- A novel potential function and combinatorial analysis of MSB transitions provide rigorous bounds on policy improvement steps.
- The methodology extends to both robust Markov chains and general robust MDPs, ensuring scalability for uncertainty models.
Strongly Polynomial Policy Iteration Algorithms for L∞ Robust MDPs
Introduction and Motivation
This paper addresses a significant problem in robust sequential decision making: the algorithmic complexity of solving Markov Decision Processes (MDPs) under L∞-rectangular uncertainty (RMDPs). In many applications, MDP transition kernels are uncertain due to estimation error or incomplete knowledge. The robust MDP framework mitigates this by evaluating policies against the worst-case realization within a structured uncertainty set. Among several uncertainty models, (s,a)-rectangular L∞ sets are especially relevant—they allow uncertainty to be modeled independently for each state-action pair, echoing practical data-driven use cases and preserving computational tractability via robust dynamic programming.
For classical (non-robust) MDPs, polynomial-time algorithms exist for the discounted setting, e.g., via linear programming, and seminal work has established strongly polynomial methods for fixed discount factors. For robust counterparts, although polynomial-time results are available for value approximation, the existence of strongly polynomial policy iteration algorithms for exact value computation in (s,a)-rectangular L∞ RMDPs remained unresolved. This work settles the open problem, proving that robust policy iteration with a fixed discount factor has strongly polynomial complexity for this important class.
The focus is on finite-state discounted robust MDPs with (s,a)-rectangular L∞ uncertainty sets around a nominal kernel Ps,a, with coordinate-wise radii δs,a. The setting generalizes classical MDPs, and—crucially—encapsulates turn-based stochastic games via polynomial reductions.
Two key policy iteration variants are considered:
- RMC-PI: For robust Markov chains (a trivial singleton action set per state).
- RMDP-PI: For general robust MDPs.
Both variants alternate between policy evaluation (exact value computation under fixed policies) and greedy policy improvement (maximization over policies within uncertainty sets). For the L∞ model, optimal transitions can be efficiently computed using a homotopy-based mass-transfer algorithm, which, for each state, reallocates local probability to maximize/minimize Bellman values under the given constraints in O(nlogn) time per state.
Technical Contributions
Novel Potential Function and Combinatorial Analysis
A central technical innovation is the introduction of a potential function, defined over probability mass transfers between transition kernels associated with the policy and its optimal counterpart. This function measures the improvement potential and enables a fine-grained analysis of policy iteration progression. The authors provide upper and lower bounds relating this potential to the value-optimality gap.
To bound overall complexity, the paper studies the evolution of the most significant bits (MSBs) in the binary representation of transition probability differences. Leveraging combinatorial arguments (including new structural lemmas resembling applications of Siegel's lemma), the analysis shows there are only O(nlogn) possible MSBs for discrepancies, and each critical mass-transfer diminishes by at least half every O(logγ(1/(2n2))) steps. Critically, this precludes the possibility of long periods of insignificant improvement per iteration.
Furthermore, a mathematical strengthening of the combinatorial lemma was provided independently by the agent Aletheia, as recognized in the paper—a relevant contribution to autonomous math research.
Strongly Polynomial Policy Iteration for Robust Markov Chains
For RMC-PI, the number of improvement steps is shown to be
O(n4logn⋅log(1/γ)log((1−γ)/n2))
where n is the number of states. Each iteration involves only strongly polynomial primal-dual computations, and both the bit-lengths and number of steps are bounded polynomially in the input sizes of state space and uncertainty set description but are independent of the cost function coefficients.
Extension to Robust Markov Decision Processes
For the general RMDP-PI, the analysis is adapted with a similar pipeline. The potential function here corresponds to action-level optimality gaps. After ruling out repeated non-improving actions, the total number of improvement steps is
O(n⋅m⋅logγlog(1−γ))
with n states and m actions. The complexity is thus strongly polynomial in problem size for any fixed discount factor γ.
Crucially, the paper provides the first proofs that policy iteration is a strongly polynomial algorithm for both robust chains and full RMDPs under (s,a)-rectangular L∞ uncertainty.
Implications and Future Directions
The central result—demonstrating strongly polynomial-time policy iteration for this class of robust MDPs—enumerates an explicit, efficient algorithm whose step count and arithmetic growth depend solely on the problem's combinatorial structure and not the numerical precision of the input data. This solidifies the theoretical tractability of robust planning under realistic uncertainty models and precludes pathological slowdowns previously possible with naive reductions to stochastic games.
Practically, this justifies the use of policy iteration for large RMDPs with uncertainty, provided the uncertainty is L∞-rectangular, and supplies explicit complexity guarantees. The mathematical apparatus provided (especially the combinatorial bounds and potential-based progress measures) may inspire similar analyses for broader classes (e.g., L1 or non-rectangular uncertainty), or for other solution paradigms such as value iteration.
Theoretically, the work resolves a notable open complexity-theoretic question, confirming that the extension to robust models introduces no fundamental barrier to strongly polynomial planning, at least in the (s,a)-rectangular L∞-discounted case. It also directly informs the complexity of turn-based stochastic games, given the reduction results.
Conclusion
This paper proves that robust policy iteration for (s,a)-rectangular L∞-uncertain discounted MDPs is strongly polynomial. The main advances are a novel potential function to measure algorithmic progress, a sharp combinatorial analysis of possible MSBs in transition difference trajectories, and complexity bounds that neither depend on input cost/transition bit-sizes nor deteriorate due to the uncertainty set’s geometry. The results close a long-standing open problem regarding the complexity of robust MDP planning and provide the foundation for further research on robust reinforcement learning and sequential decision making under uncertainty (2601.23229).