Lifted Robust Markov Decision Processes

Updated 11 November 2025

Lifted robust MDPs are a class of decision processes that model uncertainty with higher-order structures such as factor matrices and generative models.
They employ efficient algorithms like robust value iteration to leverage structured ambiguity sets, ensuring convergence and computational tractability.
These approaches guarantee strong duality, deterministic optimal policies, and sample efficiency, making them viable for safety-critical and risk-sensitive applications.

A Lifted Robust Markov Decision Problem (LRMDP) refers to a class of robust Markov Decision Processes in which uncertainty is modeled in terms of higher-order or “lifted” structures—such as generative models, factor matrices, or probability distributions—enabling the robust policy optimization problem to be both computationally tractable and less conservative compared to classical formulations. The “lifting” typically introduces additional structure or flexibility in the ambiguity/uncertainty sets, often admitting efficient tractable algorithms and preserving desirable structural properties such as the existence of deterministic optimal policies and strong duality.

1. Mathematical Formulations of Lifted RMDPs

Let $S$ denote a finite state space ( $|S|=S$ ), $A$ an action space ( $|A|=A$ ), and $r(s,a)$ bounded rewards. In classical MDPs, the transition kernel $P_{sas'} = \Pr(s_{t+1}=s' | s_t=s, a_t=a)$ is assumed known. Robust MDPs address transition uncertainty by introducing an ambiguity set or uncertainty set for the kernel, often constructed in a way that the worst-case scenario is tractable.

In lifted robust MDPs, the uncertainty is modeled in a more expressive manner:

Factor-Matrix Model (Goyal et al., 2018): The kernel is parameterized as

$P_{sas'} = \sum_{i=1}^r u^i_{sa} w_{i,s'},\quad w_i \in W^i \subset \Delta(S),\ \sum_{i=1}^r u^i_{sa} = 1$

where $u^i_{sa} \geq 0$ are fixed mixing coefficients, and the factors $w_i$ are drawn from uncertainty sets $W^i$ (“factor-matrix uncertainty”).

Generative Model Mismatch (Li et al., 2022): The true kernel is assumed to lie in a known constraint set $C(P^s)$ around a nominal simulation kernel $P^s$ , i.e.,

$P^* = P^s + \Delta^*,\quad \Delta^* \in U(P^s)$

Robust value is defined as

$\tilde V^\pi(s) = \min_{\Delta \in U(P^s)} V^\pi_{(P^s + \Delta)}(s)$

and the robust optimal policy is

$\pi^* \in \arg\max_\pi \tilde V^\pi(s_0)$

Distributionally Robust Stackelberg Setting (Bäuerle et al., 2020): For Borel state/action spaces, ambiguity sets are defined in terms of (weak*-compact) sets of measures, often parameterizing uncertainty about exogenous noise. The robust value is given by

$V_0(s_0) = \inf_\pi \sup_\gamma \mathbb{E}^{\pi \gamma}_{s_0} \left[ \sum_{n=0}^{N-1}c_n + c_N \right]$

where $\gamma$ is an adversarial sequence of conditional noise distributions.

These “lifted” uncertainty sets allow for coupling or structure across state-actions or stages, departing from the (s,a)-rectangular (fully decoupled) ambiguity sets common in earlier robust MDP literature.

2. Two-Player Game and Minimax Structure

Lifted robust MDPs are naturally interpreted as two-player zero-sum games, often Stackelberg (controller-leader, nature-follower), with the following elements:

Agent: Selects a (deterministic or randomized) policy $\pi$ to maximize (cumulative) expected reward.
Adversary/Nature: Selects kernel perturbations, uncertainty factors, or noise distributions within the allowed set to minimize the agent's value, subject to problem-dependent structural restrictions.

Formally, the robust dynamic programming operator can be written (finite-state, discounted case) as:

$(T^*V)(s) = \max_{a \in A} \min_{\Delta \in U(P^s(s,a))} \left[ r(s,a) + \gamma \sum_{s'} (P^s(s'|s,a) + \Delta(s'|s,a)) V(s') \right]$

Under r-rectangularity (factor-matrix model), minimization over $W$ decouples, enabling efficient computation.

The policy/perturbation pair $(\pi^*, \Delta^*)$ or $(\pi^*, W^*)$ forms a Nash equilibrium. For finite-horizon problems, minimax theorems ensure existence, with the robust policy corresponding to the controller's solution.

In the distributional (Borel) setting (Bäuerle et al., 2020), the robust problem takes the form of a Stackelberg game, not a simultaneous (classical zero-sum) game, but minimax interchange holds under convexity, e.g., by Sion’s theorem.

3. Algorithmic Solutions and Complexity

Lifted robust MDPs enable tractable solution algorithms, most notably via robust value iteration and policy iteration with modified Bellman operators. The structure and precision of the uncertainty set determines computational complexity:

Robust Value Iteration (Factor-Matrix / “VI-PI”) (Goyal et al., 2018):

$v^{k+1}_s = \max_{a\in A} \left\{ r_{sa} + \lambda \sum_{i=1}^r u^i_{sa} \min_{w_i \in W^i} w_i^\top v^k \right\}$

This is a $\lambda$ -contraction on $\ell_\infty$ ; error $\leq \epsilon$ is achieved in

$O(r S A \log(1/\epsilon) + \sum_i \mathrm{comp}(W^i) \log^2(1/\epsilon))$

Sample-Based Plug-in Nash-Equilibrium Solver (Li et al., 2022):
1. For each $(s,a)$ , sample $N$ draws from the simulator, yielding $\hat P(s'|s,a)$ .
2. Build empirical MDP $\hat M$ .
3. Solve the empirical two-player game for $(\hat \pi, \hat \Delta)$ .
4. Return $\hat \pi$ .

The sample complexity for achieving $\tilde V^{\pi^*}(s_0) - \tilde V^{\hat \pi}(s_0) \leq \epsilon$ with probability $1-\delta$ is

$N = \tilde O\left( (1+\lambda)^2 S A H^4 / \epsilon^2 \right)$

(possibly improved to $O(\sqrt{S} A H^4 / \epsilon^2)$ under structural assumptions on $U$ .)

Distributionally Robust Bellman Recursion (Bäuerle et al., 2020): Backward induction for Markov or history-dependent policies using operator:

$(T_n v)(s) = \inf_{a \in D_n(s)} \sup_{Q \in Q_{n+1}} \int [c_n(s,a,T_n(s,a,z)) + v(T_n(s,a,z))]\, Q(dz)$

with proven existence of deterministic optimal policies under integrability, compactness, and measurability assumptions.

4. Structural Results and Policy Properties

Key theoretical properties include:

Strong Minimax Duality: Under r-rectangularity (factor-matrix), $\max_\pi \min_W V^\pi(W) = \min_W \max_\pi V^\pi(W)$ . Optimal policies can be chosen deterministic (Goyal et al., 2018).
Existence of Nash Equilibrium: For finite-horizon, finite-space problems, robust and adversarial policies forming a Nash equilibrium can be constructed. In Borel-state settings, existence of deterministic Markov policies for both players is established under suitable compactness and continuity conditions (Bäuerle et al., 2020).
Contraction and Convergence: Both the robust Bellman operators (lifted and distributional) are $\gamma$ -contractions in appropriate norms, guaranteeing unique fixed-points and convergence of value-iteration schemes.
Blackwell Optimality: With finiteness of extreme points in the uncertainty sets, robust Blackwell optimality holds—there exists a single robust policy that is optimal for all discounts $\lambda$ greater than some $\lambda_0 < 1$ (Goyal et al., 2018).
Maximum Principle: For any robust-optimized policy and its corresponding worst-case kernel, no other policy can improve upon its robust value across all states, generalizing the classical principle (Goyal et al., 2018).

5. Practical Implications and Empirical Behavior

Empirical findings demonstrate that lifted robust MDPs often achieve lower conservativeness than classical (s,a)-rectangular robust MDPs. In the factor-matrix model, constructed via Nonnegative Matrix Factorization (NMF) on the nominal kernel, r-rectangular robust policies $\pi^{\text{rob}, r}$ :

Frequently coincide with nominal optimal $\pi^{\text{nom}}$ when the robust policy needs no additional conservativeness.
Retain the nominal reward but achieve improved worst-case value nearly as much as classic robust policies designed for full state coupling.
Lead to higher expected reward (for random, not adversarial, model errors), and less nominal loss under adversarial uncertainty compared to more conservative s-rectangular robust policies (Goyal et al., 2018).

Healthcare and machine-replacement examples confirm that r-rectangular formulations are notably less pessimistic and avoid unnecessary reward sacrifices.

In settings with generative models and constrained perturbations (Li et al., 2022), the ability to find near-optimal robust policies with polynomial sample complexity establishes practical feasibility even when direct environment access is restricted (e.g., medical or safety-critical applications).

In distributionally robust formulations (Bäuerle et al., 2020), robust DP equations explicitly recover dynamic programming with general (coherent) risk measures for suitable ambiguity sets, connecting robust control to risk-sensitive reinforcement learning and contributing to interpretability in domains with distributional ambiguity.

6. Connections to Broader Frameworks and Interpretations

Lifted robust MDPs unify several threads in stochastic control, robust dynamic programming, and reinforcement learning:

Beyond Rectangularity: The factor-matrix model generalizes (s,a)-rectangular and s-rectangular uncertainty, providing a tractable interpolation and enabling structured couplings across state-actions while preserving computational efficiency (Goyal et al., 2018).
Distributional Robustness and Risk Measures: The distributional robust dynamic programming framework (Bäuerle et al., 2020) formally coincides, for certain ambiguity sets, with minimizing dynamic, time-consistent, coherent risk measures. For example, the spectral risk measure corresponds to a specific robust Bellman recursion.
Minimax Theorems and Stackelberg Games: Sion’s minimax theorem underpins the switching of the order of decision-maker and adversary in certain convex cases, aligning robust MDPs with zero-sum game and Stackelberg control perspectives.
Algorithmic Flexibility: The lifted approach supports a wide variety of solution algorithms (robust VI/PI, plug-in NE based on generative models, backward induction for Borel spaces), depending on the modeling framework, with strong theoretical guarantees on convergence and performance.
Applicability: Lifted robust MDPs are relevant for high-stakes, safety-critical scenarios where mismatches between models (e.g., simulators) and reality necessitate robust yet not over-conservative policies, including medical treatment policies and autonomous decision systems.

7. Theoretical Guarantees and Sample Complexity

Lifted RMDPs exhibit strong statistical and algorithmic guarantees:

Sample Complexity (Generative Model Setting): For the plug-in NE approach, to achieve $\epsilon$ -suboptimality with probability $1-\delta$ , $N = \tilde O( (1+\lambda)^2 S A H^4 / \epsilon^2 )$ samples per state-action pair suffice (Li et al., 2022). Under further independence assumptions, this dependence improves to approximately $\sqrt{S}$ .
Proof Techniques: Suboptimality is decomposed into model estimation and perturbation mismatch terms, controlled by concentration bounds and Lipschitz properties, yielding polynomial dependence on all problem parameters—including state/action cardinality, horizon, and robustness parameters.
Optimality and Duality: Deterministic optimal policies and worst-case kernels exist under lifted formulations, and robust optimality principles (maximum principle, Blackwell optimality) carry over from classical dynamic programming.

These guarantees ensure that the expressive power and generality of the lifted robust MDP framework do not come at the expense of computational intractability or statistical inefficiency. Robust yet efficient policy synthesis is thereby enabled for complex, uncertain sequential decision-making problems.