Robust Constrained MDPs in Uncertain Environments

Updated 17 November 2025

RCMDPs are an advanced framework that extends classical CMDPs to enforce long-run safety constraints while providing robustness against model uncertainties.
They integrate parametric and distributional uncertainty models to handle risk in cost, reward, and transition estimates, ensuring reliable decision-making.
Solution techniques include robust dynamic programming, convex approximations, and policy gradient methods that offer provable guarantees and practical performance.

Robust Constrained Markov Decision Processes (RCMDPs) generalize the classical Constrained Markov Decision Process (CMDP) framework by jointly enforcing long-run safety or performance constraints and providing formal robustness guarantees against model uncertainty. In RCMDPs, constraints on cumulative costs or safety measures must be satisfied even in the presence of adversarial or random perturbations to the cost, reward, and/or transition kernel. This framework underpins safe decision-making in domains where both explicit safety budgets and epistemic/model uncertainty are central, including safety-critical control, risk-constrained reinforcement learning, and robust operations management.

1. Mathematical Foundations and Problem Formulation

The RCMDP formalism extends the standard finite-horizon or infinite-horizon discounted CMDP. In its core form, given a tuple $(\mathcal{S}, \mathcal{A}, c, \{d_k\}_{k=1}^K, \gamma, \mathcal{P})$ with

state space $\mathcal{S}$ , action space $\mathcal{A}$ ,
cost $c: \mathcal{S}\times\mathcal{A} \rightarrow \mathbb{R}$ (objective) and constraint cost(s) $d_k: \mathcal{S}\times\mathcal{A}\rightarrow\mathbb{R}$ ,
discount $\gamma\in(0,1)$ ,
uncertainty set of admissible transition kernels $\mathcal{P}$ (typically $s$ - or $(s,a)$ -rectangular),

the robust constrained optimization problem is

$\begin{aligned} &\min_{\pi} \max_{P\in\mathcal{P}} \mathbb{E}_{\pi,P}\left[ \sum_{t=0}^\infty \gamma^t c(s_t,a_t) \right] \ &\text{s.t. } \max_{P\in\mathcal{P}} \mathbb{E}_{\pi,P}\left[ \sum_{t=0}^\infty \gamma^t d_k(s_t,a_t) \right] \leq b_k, \quad k=1,\ldots,K. \end{aligned}$

Depending on context, maximization objectives and different uncertainty models (e.g. moment-based, divergence-based, or support-based) may appear (Ganguly et al., 25 May 2025, Bossens et al., 29 Jun 2025, Ganguly et al., 10 Nov 2025, Russel et al., 2021, Kitamura et al., 29 Aug 2024, Bossens, 2023).

No general convexity or strong duality can be assumed because the set of occupancy measures robustified over $\mathcal{P}$ is often non-convex, and the worst-case transition law for the objective may differ from that for the constraint, violating the usual Lagrangian strong duality underpinning CMDP algorithms (Kitamura et al., 29 Aug 2024, Ganguly et al., 10 Nov 2025, Ganguly et al., 25 May 2025).

2. Uncertainty Models and Robustness Criteria

2.1 Parametric and Distributional Models

Parametric uncertainty via $s$ - or $(s,a)$ -rectangular sets, e.g., confidence intervals/L₁-balls around estimators:

$\mathcal{P} = \bigotimes_{(s,a)} \{ P(\cdot\,|\,s,a) : D(P(\cdot\,|\,s,a)\,\|\;\widehat{P}(\cdot\,|\,s,a))\le\rho \}$

for a chosen divergence $D$ (KL or total variation) and radius $\rho$ (Ganguly et al., 25 May 2025, Bossens et al., 29 Jun 2025, Varagapriya, 15 Mar 2025, Ganguly et al., 10 Nov 2025).

Distributional robustness via ambiguity sets for costs, rewards, or transition law modeled by KL-divergence, φ-divergence, or Wasserstein balls (Nguyen et al., 2022, Xia et al., 2023, Zhang et al., 2023).
Chance constraints and probabilistic safety: specifying that constraints hold with high probability even under uncertain distributions. Joint chance constraints may be modeled using copulas (e.g. Gumbel–Hougaard) to encode dependency among constraints (Varagapriya et al., 30 May 2025, Xia et al., 2023).

2.2 Randomness in Model Data

Uncertainties in cost and constraint cost vectors are specified as random variables with known moments or bounds. This requires quantifying risk across both objective and constraints (Varagapriya et al., 30 May 2025, Nguyen et al., 2022, Xia et al., 2023).
Randomness in transition probabilities is modeled as mean plus bounded-support random perturbations (Varagapriya et al., 30 May 2025), enabling convex approximations of the joint chance-constrained problem.

3. Solution Techniques: Convex and Nonconvex Approaches

3.1 Inner and Outer Approximation

For RCMDPs with random costs and/or transitions under chance constraints, tractable convex inner (upper bound) and outer (lower bound) approximations are constructed (Varagapriya et al., 30 May 2025, Xia et al., 2023):

Convex inner approximations (for robust policy synthesis):
- Use tail inequalities (Chebyshev, Hoeffding, Bernstein, sub-Gaussian) to replace chance constraints $\mathbb{P}[r^T Z \leq a] \geq p$ with convex surrogates:
$\mathbb{E}[r^T Z] + f(p)V(r) \leq a$

where $f(p)$ and $V(r)$ are specified by the chosen inequality and moment assumptions. - For joint chance constraints, copula theory (e.g., Gumbel–Hougaard) is used to separate global risk into tractable marginal components and individual chance levels (Varagapriya et al., 30 May 2025). - Resulting programs are typically SOCPs.
Linear outer approximations provide lower bounds via relaxations with slack variables and tangent approximations, producing LPs that bracket the robust value (Varagapriya et al., 30 May 2025).

3.2 Robust Dynamic Programming and Value Iteration

Augmented state space: When standard Markovian policies are suboptimal (due to constraint–robustness coupling), augment the state with the residual budget—proving sufficiency of Markov policies in the enlarged state (e.g., $x_h=(s_h, c_h)$ with $c_h$ the remaining utility budget) (Ganguly et al., 10 Nov 2025).
Robust Bellman operators: Value iteration or policy iteration using

$(T^{\pi, \mathcal{P}} V)(s) = \sum_{a} \pi(a|s) \min_{P \in \mathcal{P}_{s,a}} \sum_{s'} P(s'|s,a)[r(s,a,s') + \gamma V(s')]$

with robustification for both reward and constraint functions (Bossens et al., 29 Jun 2025, Russel et al., 2021).

Sample complexity: For rectangular TV/KL/χ² divergence sets, the robust constrained value iteration (RCVI) algorithm achieves the first provable $\widetilde{O}(|S||A|H^5/\varepsilon^2)$ sample complexity bound for $\varepsilon$ -robust feasibility and performance (Ganguly et al., 10 Nov 2025).

3.3 Policy Gradient and Primal–Dual Methods

Mirror descent and saddle-point schemes: Policy mirror ascent (for $\pi$ ) interleaved with adversarial mirror-descent (for $P$ ) on the robust Lagrangian, typically with a Bregman divergence regularizer. Provable $O(1/T)$ convergence (deterministic) or $\widetilde O(T^{-1/3})$ (stochastic) is established for the saddle-point gap (Bossens et al., 29 Jun 2025).
Epigraph reformulation: Binary search for the minimal robust value $t$ for which a feasible policy exists, combined with projected policy-gradient on a block-max constraint, avoiding gradient conflict and guaranteeing $\widetilde O(\varepsilon^{-4})$ robust policy evaluations for $\varepsilon$ -optimality (Kitamura et al., 29 Aug 2024).
Robust Lagrangian/adversarial policy-gradient: Joint update of policy, Lagrange multipliers, and an explicit parameterized adversary for the worst-case transition, with distributed ascent-descent and two-time-scale convergence to first-order saddle points (Bossens, 2023).

3.4 Scenario Reduction and Semi-infinite Programming

Semi-infinitely constrained MDPs: The robust constraint “ $\forall\,P \in \mathcal{P}: E_{P, \pi}[c(s,a)] \leq 0$ ” is encoded as an infinite family of constraints, and solved via dual exchange (cutting-plane) or random search within the uncertainty parameter space (Zhang et al., 2023).
For finite but high-dimensional uncertainty sets, scenario reduction methods or sample-based approaches are required to retain computational tractability.

4. Theoretical Guarantees and Practical Performance

4.1 Gap and Sample Complexity Guarantees

RCMDP convex–LP bracketing yields $\epsilon$ -optimal guarantees at specified statistical levels for both objective and constraints (Varagapriya et al., 30 May 2025), with practical gaps much tighter than the theoretical worst-case.
For the RCVI approach, the sample complexity to achieve $\varepsilon$ -violation is $\widetilde{O}(|S||A|H^5/\varepsilon^2)$ , matching the best unconstrained robust MDP rates (Ganguly et al., 10 Nov 2025).
Iteration complexity for direct policy-gradient methods is $O(\varepsilon^{-2})$ when avoiding binary search, and $\widetilde{O}(\varepsilon^{-4})$ if employing an epigraph double-loop (Kitamura et al., 29 Aug 2024, Ganguly et al., 25 May 2025).

4.2 Empirical Benchmarking

Bernstein-based convex SOCPs typically outperform other distribution-free methods in closing the inner/outer approximation gap (Varagapriya et al., 30 May 2025).
RCVI robust value iteration achieves constraint satisfaction from the first iteration and converges significantly faster than policy-gradient baselines such as RNPG or robust CRPO (Ganguly et al., 10 Nov 2025).
Adversarial policy-gradient and robust Lagrangian methods yield better penalized returns and tighter constraint satisfaction in domains with stochastic perturbations compared to non-robust and non-constrained ablations (Bossens, 2023).

5. Modeling Choices, Limitations, and Interpretations

5.1 Structural and Policy Constraints

Arbitrary structural constraints (e.g., decision-tree complexity, memory bounds) may be incorporated into the policy space using first-order logic encoding, with feasibility verification delegated to an integrated SMT + probabilistic model-checking toolchain (Heck et al., 11 Nov 2025).
Robust constrained policy synthesis is computationally nontrivial (NP-/coNP-hard), but modern approaches combining SAT/SMT and probabilistic model checking (e.g., PAYNT, Storm, SMPMC) solve hundreds of RCMDP benchmarks that were previously intractable (Heck et al., 11 Nov 2025).

5.2 Open Problems and Cautions

Absence of strong duality and non-convexity preclude general optimality guarantees for primal–dual algorithms. The potentially different worst-case transition kernels for reward and constraint necessitate careful algorithmic design (Ganguly et al., 25 May 2025, Kitamura et al., 29 Aug 2024).
Binary-search-based and double-loop approaches can become computationally expensive as problem size grows or discount factor approaches unity (Kitamura et al., 29 Aug 2024, Ganguly et al., 25 May 2025).
Empirical performance depends critically on the tightness of convex or scenario-based approximations; attention must be paid to the choice and calibration of uncertainty sets (Xia et al., 2023, Nguyen et al., 2022).
Extensions to non-rectangular uncertainty, correlated uncertainties, or partially observed settings are active topics for future research.

6. Applications and Extensions

RCMDPs have seen application in queuing control, robust resource allocation, safe navigation, inventory management with uncertain demand, and decision-tree policy synthesis for systems with structural limitations (Varagapriya et al., 30 May 2025, Heck et al., 11 Nov 2025, Bossens, 2023). In highly safety-critical or uncertain environments, RCMDP methods bridge the classical robust optimization and risk-constrained reinforcement learning paradigms, offering computationally tractable, theoretically certified recipes for robust safe policy synthesis. RCMDPs also interface directly with distributionally robust optimization, chance-constrained programming, and advanced statistical model-checking and verification, representing a cornerstone in the design of verifiably reliable AI systems under uncertainty.