Robust Counterfactual Optimization in MDPs

Updated 8 June 2026

The paper presents a framework that integrates causal inference with counterfactual reasoning to develop MDP policies robust to model uncertainties and distributional shifts.
It leverages adversarial worst-case analysis and regularization techniques to mitigate spurious correlations and optimize policy performance.
Empirical results demonstrate improved returns and safety guarantees across applications such as sepsis treatment, inventory control, and continuous control tasks.

Robust counterfactual optimization in Markov Decision Processes (MDPs) addresses the design of policies that perform reliably under uncertainty in the underlying environment dynamics, model misspecification, or distributional shifts, especially when the optimization relies on counterfactual reasoning. The field leverages formal causal models, adversarial worst-case analysis, optimization under bounded model uncertainty, and principled regularization to quantify and mitigate the impact of nonidentifiability and spurious correlations. This entry surveys foundational formalism, core algorithmic strategies, regularization and optimality principles, influence constraints, and benchmarks for robust counterfactual policy synthesis.

1. Formal Foundations of Counterfactual Reasoning in MDPs

Robust counterfactual optimization depends on a precise distinction between observational, interventional, and counterfactual distributions within the MDP framework. An MDP is defined by a tuple $(\mathcal S, \mathcal A, P, P_I, R, \gamma)$ , with standard elements for state and action spaces, transition kernel $P(s'|s,a)$ , initial law $P_I$ , reward $R$ , and discount $\gamma$ (Lally et al., 19 Feb 2025, Caron et al., 12 Mar 2025).

Structural causal models (SCMs) extend this framework by specifying per-step causal mechanisms. In a Causal Markov Decision Process (C-MDP), the endogenous variable set comprises states, actions, and rewards (e.g., $\{X_t, Z_t, A_t, R_t, X_{t+1}, Z_{t+1}, R_{t+1}, ...\}$ ), and exogenous noise variables model inherent stochasticity. Structural functions define how each variable is generated from its causal parents and the exogenous noise. The associated causal Bayesian network provides conditional independence structure and enables counterfactual queries via “do” interventions (Caron et al., 12 Mar 2025).

Counterfactual transitions under an SCM are constructed by (1) abduction—inferring noise realizations consistent with the observed trajectory—(2) intervention—altering the value of an action or variable of interest (the “do” operator), and (3) prediction—generating the next state and reward using the structural equations with fixed abduced noise. This sequence produces a counterfactual $(s', r)$ under any hypothetical action $a'$ starting from an observed $(s, a, s', r)$ .

Critical to robust inference is the observation that, in general, counterfactual transition distributions are not uniquely determined by observed and interventional data: many SCMs may explain the same observed transitions, inducing epistemic uncertainty in counterfactual prediction (Lally et al., 19 Feb 2025).

2. Problem Formulations and Objective Functions

Robust counterfactual optimization is typically given as a policy optimization problem under ambiguous or adversarial models of environment transitions. Several main formulations are prominent:

Worst-case robust control: Given a set of MDPs $\mathcal{P}$ consistent with observed/interventional statistics, the robust policy $P(s'|s,a)$ 0 maximizes the minimum expected return over all $P(s'|s,a)$ 1:

$P(s'|s,a)$ 2

(Lally et al., 19 Feb 2025, Guha, 2023).

Counterfactual regularized objectives: In C-MBPO, the policy optimization includes not only the expected return under the estimated model but also a regularization term that penalizes differences in predicted rewards under factual and counterfactual interventions:

$P(s'|s,a)$ 3

with

$P(s'|s,a)$ 4

(Caron et al., 12 Mar 2025).

Constrained optimization for safety or explanation: Counterfactual policy optimization may impose a constraint that the post-intervention failure probability (or other risk metric) falls below a threshold while minimizing deviation from a baseline policy (Kobialka et al., 14 May 2025).
Bilevel minimax regret under constrained policy classes: For mixed discrete-continuous domains, solutions may minimize the maximum regret over all initial conditions and exogenous noise realizations, restricting policies to interpretable classes such as piecewise linear functions (Gimelfarb et al., 2024).

3. Algorithmic Methodologies and Solution Strategies

Multiple algorithmic strategies enable robust counterfactual optimization, exploiting both causal structure and minimax adversarial reasoning.

Interval MDP and Tight Counterfactual Bounds:

Given the set of all SCMs compatible with observational and interventional data, interval bounds $P(s'|s,a)$ 5 can be computed in closed form for all $P(s'|s,a)$ 6, enabling the formulation of an interval MDP $P(s'|s,a)$ 7 (Lally et al., 19 Feb 2025). Robust value iteration optimizes the worst-case value function over all admissible transitions within these bounds.

No-Regret Game Dynamics:

Robust policy search can be formulated as a zero-sum game between a policy player and a model adversary. Each player updates strategies via no-regret online learning (e.g., Optimistic Follow-The-Regularized-Leader, FTRL), ensuring convergence to minimax optimality within $P(s'|s,a)$ 8 suboptimality bounds (Guha, 2023).

Counterfactual Influence Constraints:

To prevent the “counterfactual world” from drifting too far from observed data (thus becoming merely interventional), influence constraints enforce that each counterfactual trajectory remains “influenced” by the factual path within a $P(s'|s,a)$ 9-step horizon (support overlap of transitions). Algorithms prune the counterfactual MDP to include only transitions with such support overlap, then compute optimal policies under both budgeted action deviations and these influence constraints (Kazemi et al., 2024).

Constraint-Generation Optimization Loops:

In mixed discrete-continuous spaces, bilevel optimization is structured as a constraint-generation loop: an outer optimization adjusts policy parameters to minimize regret over a set of already-identified adversarial (counterfactual) trajectories, while an inner maximization discovers new worst-case trajectories for further constraint addition. Upon convergence or zero optimality gap, the resulting policy is certified robust in the chosen class (Gimelfarb et al., 2024).

Mixed-Integer Quadratic Programming for Minimal Deviations:

To minimally perturb a baseline policy (in, e.g., total-variation distance) while enforcing risk or reachability constraints, the problem is encoded as a mixed-integer quadratically constrained quadratic program (MIQCQP). This setup accommodates diversity-promoting penalties to extract multiple, novel counterfactual strategies (Kobialka et al., 14 May 2025).

4. Theoretical Guarantees and Robustness Properties

These methodologies achieve explicit robustness properties:

Epistemic robustness under causal ambiguity:

Interval MDP approaches guarantee that all counterfactual transition probabilities are tight over the full set of SCMs compatible with observational and interventional distributions, yielding policies provably robust to identification ambiguity (Lally et al., 19 Feb 2025).

Robustness to noncausal shift:

C-MBPO’s counterfactual regularizer ensures that the learned policy does not exploit spurious, noncausal correlations—thus, performance degradation under shifts that alter noncausal links is bounded (e.g., $P_I$ 0) (Caron et al., 12 Mar 2025).

Optimality within policy classes:

Constraint-generation and MIQCQP methods provide, upon convergence, policies that are globally optimal within the specified policy class under the worst-case regret or risk definition (Gimelfarb et al., 2024, Kobialka et al., 14 May 2025).

Faithfulness and reward-influence trade-offs:

By adjusting influence horizon $P_I$ 1 and allowed action deviations $P_I$ 2, the reward–influence curve is controlled: small $P_I$ 3 retains faithfulness to the original trajectory (counterfactual), large $P_I$ 4 allows for potentially higher-reward but less tailored, interventional solutions (Kazemi et al., 2024).

Computational guarantees:

Interval-MDP value iteration has polynomial time complexity $P_I$ 5 per sweep, and constraint-generation methods can deliver regret certificates (upper bounds) at every iteration (Lally et al., 19 Feb 2025, Gimelfarb et al., 2024). MIQCQP approaches are nonconvex and may timeout for very large state spaces, but often are tractable for models with up to tens of thousands of states (Kobialka et al., 14 May 2025).

5. Empirical Results and Benchmark Analyses

Empirical studies demonstrate the practical viability and effects of robust counterfactual optimization:

C-MBPO achieves 15–30% and 20–40% superior returns to standard MBRL and model-free PPO, respectively, under strong distributional drift. For example, in a pendulum domain under near-OOD conditions: MBPO 120±5, PPO 110±10, C-MBPO 140±8; under far-OOD: MBPO 80±7, PPO 90±12, C-MBPO 130±9 (Caron et al., 12 Mar 2025).
Interval MDP policies are more conservative and achieve improved minimum cumulative reward in high-stochasticity and medical settings versus SCM sampling methods, with computation taking milliseconds rather than seconds (Lally et al., 19 Feb 2025).
Influence-constrained MDPs yield smooth trade-offs in reward vs. faithfulness, and enable finding near-optimal counterfactual plans that maintain causal tether to the original data. In sepsis treatment, optimal counterfactuals strongly depend on whether critical catastrophic endpoints are avoidable within tight influence constraints (Kazemi et al., 2024).
Constraint-generation (CGPO) recovers classical policies (e.g., $P_I$ 6 inventory control rules, threshold-reservoir policies), and produces adversarial worst-case trajectories that directly expose the policy's failure modes (Gimelfarb et al., 2024).
MIQCQP synthesis solves all nontrivial instances (e.g., $P_I$ 7 states: $P_I$ 8s) optimally in real-world log-derived MDPs, and can generate sets of diverse counterfactual policies—typically with 20–40% of changed actions occurring in novel states (Kobialka et al., 14 May 2025).

6. Influence Constraints, Model Ambiguity, and Practical Considerations

Counterfactual optimization must address the risk that counterfactual trajectories become untethered from the observed data, transitioning from truly counterfactual to merely interventional evaluation. To ensure that policy recommendations remain tailored to individual trajectories, “ $P_I$ 9-step influence” is enforced—the path remains empirically relevant for $R$ 0-step deviations. Efficient pruning algorithms restrict the accessible states and transitions in counterfactual planning to those satisfyingly anchored in the factual path (Kazemi et al., 2024).

On the causal inference side, robust optimization circumvents the nonidentifiability of counterfactuals by optimizing under all compatible SCMs (interval MDP), as opposed to selecting a single, potentially misspecified SCM (Lally et al., 19 Feb 2025). This produces policies optimized for the worst-case plausible counterfactual, rather than for inferences tied to a possibly arbitrary structural choice.

7. Extensions and Application Domains

Robust counterfactual optimization methods are applied to domains including program evaluation, medical treatment (notably sepsis), inventory control under demand uncertainty, water reservoir management, and continuous-control tasks with strongly nonlinear dynamics. In these contexts, counterfactual optimization frameworks improve explainability (via generated counterfactual trajectories), yield performance guarantees even under extreme dynamics shifts, and construct diverse, minimally modified policy recommendations with explicit safety and interpretability properties (Gimelfarb et al., 2024, Caron et al., 12 Mar 2025, Kobialka et al., 14 May 2025).

Robustness to distributional shift, nonparametric model ambiguity, and explicit influence constraints collectively characterize the methodological landscape, with ongoing work focusing on scalable solution techniques and principled balancing of reward gain versus causal faithfulness.