Causal MDPs: A Causal RL Framework

Updated 20 January 2026

C-MDPs are an extension of standard MDPs that incorporate explicit causal structures to model state transitions and rewards using structural causal models.
They enable counterfactual reasoning, robust policy optimization, and efficient exploration by leveraging causal inference techniques.
Applications include safety verification, interpretable decision-making, and enhanced reinforcement learning in complex, uncertain environments.

A Causal Markov Decision Process (C-MDP) is a formalism that enriches the classical Markov Decision Process with explicit causal structure governing the evolution of states and rewards. By leveraging structural causal models (SCMs) and causal graphical models, the C-MDP framework provides a foundation for counterfactual reasoning, robust policy optimization, interpretable causal influence assessment, and efficient exploration in reinforcement learning. The following presents a comprehensive overview of C-MDPs, covering formal definitions, structural properties, canonical learning/regret results, robust and uncertainty-aware extensions, and relationships to verification and explanation.

1. Fundamental Formalisms of Causal MDPs

Causal MDPs generalize standard MDPs by embedding the stepwise transition and reward mechanisms into a structured causal generative model. The common mathematical foundation is as follows:

Let $\mathcal{M} = (\mathcal{X}, \mathcal{A}, \mathcal{P}, \mathcal{R}, H)$ , where

$\mathcal{X}$ : set of (possibly structured) states;
$\mathcal{A}$ : set of possible interventions or actions;
$\mathcal{P}$ : (possibly factorized) transition mechanism;
$\mathcal{R}$ : reward function (often with causal dependencies);
$H$ : time horizon (finite or infinite).

The key augmentation is that, for every stage, state, and action, the transitions and rewards are determined by structural equations within an acyclic causal graphical model $G$ (Bayesian network or SCM) (Lu et al., 2021, Gonzalez-Soto et al., 2019). Formally, for a state $s$ and action $a$ , let $Z^{S}$ and $Z^{R}$ be minimal sets of parent (causally relevant) variables. The system dynamics take the form: $\mathbb{P}(s' | s, a) = \sum_{z \in Z} \mathbb{P}(s' | s, z)\, \mathbb{P}(z | s, a)$

$R(s,a) = \sum_{z \in Z} R(s, z)\, \mathbb{P}(z | s, a)$

where $Z = Z^{S} \cup Z^{R}$ and $R(s, z)$ , $\mathbb{P}(s'|s, z)$ are defined by the SCM's local conditional probability tables. Interventions $do(X=x)$ correspond to forcibly assigning variable(s) $X$ within $G$ .

Actions in a C-MDP are interpreted as interventions, and the system's state can include both observable variables and latent context variables $Z_t$ (for confounding, uncertainty, or latent information) (Venkatesh et al., 8 Dec 2025, Laan et al., 12 Jan 2025).

2. Structural Graphical Properties and Counterfactual Semantics

A canonical C-MDP is concretely specified as a (possibly dynamic) Bayesian network or SCM unrolled across time, with nodes at each $t$ representing state ( $X_t$ ), action ( $A_t$ ), additional endogenous variables ( $Z_t$ ), and reward ( $R_t$ ), and edges encoding all direct causal relationships (Caron et al., 12 Mar 2025, Kazemi et al., 2022, Kazemi et al., 2024).

Causal DAG Construction: Edges are directed, e.g. $X_t \to X_{t+1}$ , $A_t \to X_{t+1}$ , $X_t \to Z_t$ , $X_t \to R_t$ , etc. (see Figure in (Caron et al., 12 Mar 2025)).
Interventions: $do(A_t = a)$ is operationalized by replacing the stochastic law $\mathbb{P}(A_t | X_t, Z_t)$ with a delta function at $a$ , and then proceeding as per the truncated factorization.
Counterfactuals: Rollouts under alternative action sequences or policies are simulated by re-sampling exogenous variables (noise) consistent with previously observed transitions, allowing for counterfactual path distributions (Kazemi et al., 2022, Kazemi et al., 2024).

This explicit separation supports targeted interventions, enables on-policy or off-policy counterfactual estimation, and allows direct application of do-calculus for inferring the effects of actions (Gonzalez-Soto et al., 2019).

3. Policy Learning and Exploration in Causal MDPs

The injection of causal structure supports more sample-efficient and robust learning algorithms:

Exploration Complexity Reduction: If interventions modulate only a low-dimensional set of parent variables $Z$ (even if there are many possible actions/interventions $A$ ), exploration complexity can depend on $Z$ rather than $A$ (Lu et al., 2021). The C-UCBVI and CF-UCBVI algorithms achieve $\tilde{O}(HS\sqrt{ZT})$ and lower regret bounds, where $Z$ is the parent variable support and $T$ is sample size.
Model-Based Causal Policy Optimization (C-MBPO): Learn an SCM of the transition and reward mechanism from trajectories, infer a C-MDP, and use this SCM to simulate counterfactual transitions and rewards for robust policy optimization (Caron et al., 12 Mar 2025).
Bandit and Two-Stage Causal MDPs: Structural bandit and two-stage models exploit causal independence among variables, decomposing the learning process and enabling convex optimization-based exploration with tight, instance-dependent simple-regret bounds (Madhavan et al., 2021).

Algorithmically, exploration policies and regret bounds are demonstrably improved when exploiting causal structure, especially when actions are high-dimensional interventions but only a small subset of causal mechanisms matter for transitions or rewards (Lu et al., 2021, Madhavan et al., 2021).

4. Robustness and Counterfactual Analysis under Uncertainty

Causal MDPs support rigorous analysis under model uncertainty, confounding, distributional shift, or ambiguous causal structure:

Interval Counterfactual MDPs: Compute nonparametric tight bounds on counterfactual probabilities across all SCMs consistent with data; yield a robust interval MDP (IMDP) for worst-case policy analysis; analytical bounds can be computed efficiently and integrated into robust value iteration (Lally et al., 19 Feb 2025).
Confounding and Proximal Methods: When observed data are confounded by latent contexts, causal identification of interventional rewards is achieved through proximal reweighting using observed proxies, yielding a surrogate deconfounded MDP for policy evaluation or planning (Venkatesh et al., 8 Dec 2025).
Parametric and PAC Guarantees: For MDPs with parametric transition uncertainty, causality is defined in a probability-raising sense over a parameter space, and PAC lower bounds are established for the probability that a state set is a true cause of a specified event (Oura et al., 9 Jul 2025).

This robust causal reasoning enables policies that are less sensitive to spurious correlations and more resilient to out-of-distribution or adversarial changes in environment dynamics (Caron et al., 12 Mar 2025, Lally et al., 19 Feb 2025).

5. Causal Verification, Explanation, and Temporal Reasoning

Causal MDPs provide a formal foundation for verification, explainability, and formal logic:

Causal Temporal Logic: PCFTL extends PCTL* with interventional and counterfactual operators, interpreted over SCMs derived from MDPs, enabling specification and model-checking of complex “what-if” safety requirements (Kazemi et al., 2022).
Causal Explanation Frameworks: SCM-based approaches decompose an agent’s decision into minimal, semantically distinct actual causes (e.g., state variables, transitions, rewards) of actions or failures, supporting precise responsibility assignment in autonomous and safety-critical systems (Nashed et al., 2022).
Probability-Raising Causality and Cause Metrics: Quantitative cause-effect relations are defined in terms of the probability-raising principle, allowing efficient algorithms to check and quantify the extent to which states (or other sets) are necessary or sufficient causes of events, effects, or path properties (Baier et al., 2022).

This suite of tools supports both formal verification in safety-critical systems and the production of interpretable, mathematically grounded explanations for sequential decision behavior.

6. Connections and Extensions in Causally Structured RL

The C-MDP paradigm unifies and extends numerous lines of research in reinforcement learning, causal inference, and sequential decision making:

Sequential Decision Problems as Causal Models: The mapping of Causal Decision Problems under Uncertainty (CDPU) to Markovian, history-dependent causal models shows that all standard MDP logic can be captured within the SCM formalism (Gonzalez-Soto et al., 2019).
Semiparametric Inference in Infinite-Horizon C-MDPs: Recent advances in double reinforcement learning apply semiparametric debiasing and efficient influence function theory to estimate policy values in C-MDPs robustly under misspecification and with relaxed overlap conditions (Laan et al., 12 Jan 2025).
Disentangled and Structured Causal MDPs: Separation of deterministic and stochastic state features (SD-MDP) allows compressed, structure-exploiting planning methods (e.g., improved MCTS with value clipping), achieving exponential reductions in planning complexity in constrained control applications (Liu et al., 2024).

Causal MDPs thus intersect with verification, robust RL, counterfactual reasoning, and explainable AI, providing a comprehensive mathematical and algorithmic foundation for next-generation sequential decision systems.

Key References: