Adversarial MDPs: Challenges & Algorithms

Updated 8 March 2026

Adversarial MDPs are reinforcement learning frameworks where an adversary selects rewards and transitions, challenging conventional stationarity assumptions.
They focus on regret minimization and trade-offs between robustness and adaptability in worst-case, nonstationary environments.
Key methodologies include online convex optimization, policy optimization, and robust control techniques to ensure secure and adaptive learning.

Adversarial Markov Decision Processes (MDPs) generalize classical reinforcement learning by relaxing the standard stationarity assumption, allowing transition kernels or reward/cost functions to be chosen by an adversary. This adversarial regime models fundamental challenges in online learning, robust control, multi-agent systems, and security—where the environment may change arbitrarily and even adversarially to minimize the agent's performance or satisfaction of constraints. Central questions include regret minimization under worst-case nonstationarity, trade-offs between robustness and adaptability, and the fundamental limits imposed by feedback, structure, and adversary power.

1. Formal Models of Adversarial MDPs

The adversarial MDP paradigm encompasses a spectrum of nonstationarity, structured by which components are adversarial:

Adversarial rewards/losses: At each episode or step, the cost/reward function is chosen adversarially; transitions may be fixed or unknown but stationary (Wang et al., 2020, Rosenberg et al., 2019, Tiapkin et al., 2024).
Adversarial transitions: The transition kernel can be adversarial, either at all steps or for a subset (partially adversarial transitions) (Abbasi-Yadkori et al., 2013, Schlisselberg et al., 10 Feb 2026).
Combined adversary: Both transitions and rewards are adversarial, possibly simultaneously.
Preference-based or bandit feedback: Learner observes not the full loss, but relative preferences or only the loss incurred on their trajectory (Tsuchiya et al., 15 Jul 2025, Arora et al., 2012).

A canonical mathematical model for episodic adversarial MDPs specifies:

State space $S$ , action space $A$ , horizon $H$ , number of episodes $T$ .
At episode $t$ , an adversary selects a loss/reward function $\ell_t: S \times A \to [0,1]$ and/or a transition kernel $P_t(\cdot|s,a)$ (Abbasi-Yadkori et al., 2013, Schlisselberg et al., 10 Feb 2026).
The agent selects a policy $\pi_t$ , plays, and receives loss according to the realization.
Performance is measured by (pseudo-)regret relative to the best fixed policy in hindsight.

Degree of adversarial power (oblivious vs adaptive), type of feedback (full, bandit, delayed), and structural constraints (e.g., loop-free, weakly communicating, mixing time bounds) are critical modeling axes.

2. Fundamental Regret Rates and Lower Bounds

Regret—the accumulated difference between the learner’s total loss and that of the best fixed policy in hindsight—is the primary minimax performance metric.

Adversarial losses, stationary transitions:
- Classic result: Regret scales as $O(\sqrt{HSA\,T})$ for tabular settings, where $H$ = horizon, $S$ = states, $A$ = actions, $T$ = episodes (Tiapkin et al., 2024, Tsuchiya et al., 15 Jul 2025). Lower bounds match (Tsuchiya et al., 15 Jul 2025).
Partially adversarial transitions: For $\Lambda$ adversarial transition steps per episode, regret scales as $\tilde O(H S^\Lambda \sqrt{KSA^{\Lambda+1}})$ for $K$ episodes (Schlisselberg et al., 10 Feb 2026).
Fully adversarial transitions: In the worst case (adversarial transitions at every step, bandit feedback), regret is exponential in the horizon: $R_K = \Theta(\sqrt{A^H S K})$ (Schlisselberg et al., 10 Feb 2026).
Bandit feedback: When only the incurred loss is observed, the minimax rate worsens (e.g., $O(T^{3/4})$ in deterministic MDPs with adversarial rewards and bandit feedback (Arora et al., 2012)).
Unknown transitions: Optimism under uncertainty and adaptive confidence sets enable matching regret of $O(H S \sqrt{A T})$ under bandit or full information (Jin et al., 2019, Rosenberg et al., 2019).
Preference-based losses (dueling feedback): For preference-based MDPs under Borda-score feedback, lower bounds scale as $\Omega((H^2 S K)^{1/3}T^{2/3})$ (Tsuchiya et al., 15 Jul 2025).
Linear/structured MDPs: In linear mixture MDPs with adversarial rewards and known features, regret is $\tilde O(d H \sqrt{T})$ where $d$ is the feature dimension (He et al., 2021).

A summary table for leading regret rates (tabular, episodic):

Setting	Minimax Regret	Reference
Adversarial loss, known $P$	$\Theta(\sqrt{HSA\,T})$	(Tsuchiya et al., 15 Jul 2025)
PbMDP w/ Borda scores	$\Omega((H^2SK)^{1/3}T^{2/3})$	(Tsuchiya et al., 15 Jul 2025)
$\Lambda$ adversarial steps	$\tilde O(H S^\Lambda \sqrt{KSA^{\Lambda+1}})$	(Schlisselberg et al., 10 Feb 2026)
Fully adv. trans., bandit loss	$\Theta(\sqrt{A^H S K})$	(Schlisselberg et al., 10 Feb 2026)
Bandit feedback, unknown $P$	$\tilde O(H S \sqrt{A T})$	(Jin et al., 2019)

Sharp lower bounds illustrate that adversarial nonstationarity is fundamentally harder than classical stochastic RL.

3. Core Algorithms and Methodologies

Several algorithmic frameworks dominate the adversarial MDP landscape:

Online Convex Optimization in Occupancy Space: Reduces adversarial MDP learning to online mirror descent (OMD) over occupancy measures, using entropic (KL) or other regularization (Rosenberg et al., 2019, Jin et al., 2019, Stradi et al., 2024). Mirror descent is performed within confidence sets induced by transition and cost estimation, supporting adversarial losses and unknown dynamics.
Follow-the-Perturbed-Leader (FPL) and Bandit Linear Opt.: FPL adds random tie-breaking to the expert selection paradigm, ensuring stability in policy selection under adversarial reward sequences, with matching minimax regret for both known and unknown transitions (Wang et al., 2020, Rosenberg et al., 2019). Reductions to bandit linear optimization are essential under bandit feedback (Arora et al., 2012).
Policy Optimization (PO): Recent advances leverage black-box online linear optimization on estimated advantage functions, with per- $(h,s)$ OLO updates and dynamic programming value propagation—avoiding the need for explicit occupancy-measure optimization (Tiapkin et al., 2024, Lancewicki et al., 2020). This leads to improved $S$ -dependence and high implementation efficiency.
Optimistic Algorithms: Optimism in face of uncertainty is implemented via dynamic maintenance of confidence sets for unknown transitions (e.g., through empirical Bernstein cones), ensuring the learner is always optimistic with respect to the best-possible models consistent with observed data (Jin et al., 2019, He et al., 2021).
Conditioned Occupancy Measures (COMs): To handle partially adversarial transitions, conditioned occupancy measures allow effective convexification by conditioning on realized tuples at adversarial steps, enabling OMD and sharper regret rates (Schlisselberg et al., 10 Feb 2026).
Explicit Exploration and Delay Compensation: Algorithms for delayed feedback (e.g., Delayed-OPPO) inject explicit uniform exploration at states where feedback is delayed, mitigating the impact of delayed loss/cost observations (Lancewicki et al., 2020).

For preference-based MDPs, the use of unbiased and importance-weighted estimators for Borda scores is essential to navigate the exploration-exploitation tradeoff (Tsuchiya et al., 15 Jul 2025).

4. Robustness, Constraints, and Security Considerations

Adversarial MDP analysis naturally connects to robustness, safety, and adversarial attack/defense in reinforcement learning:

Distributionally Robust MDPs (DRMDPs): Seek policies maximizing the worst-case expected reward under a set of plausible models (e.g., Wasserstein or $\phi$ -divergence ambiguity sets), yielding tractable robust Bellman recursions (Chen et al., 2018). Solutions for robust MDPs can be formulated as a sequence of convex programs, with guarantees under mild regularity.
Worst-Case Nonstationary MDPs: When the evolution of dynamics/rewards is bounded (e.g., Lipschitz in time), robust zero-shot planning can be framed as an adversarial game: the Risk-Averse Tree Search (RATS) algorithm performs minimax backups at each step, realizing worst-case guarantees under model evolution bounds (Lecarpentier et al., 2019).
Constrained MDPs with Adversarial Losses: Algorithms optimize regret subject to hard (stochastic) constraints (e.g., cost, safety), ensuring either sublinear cumulative positive constraint violation (Stradi et al., 2024) or, under Slater-type conditions, per-episode feasibility with known strictly feasible policies. Regret bounds incur unavoidable dependence on the Slater margin parameter.
Control-Channel and Actuator Attacks: Adversarial agents may manipulate the actions executed by a controller (either openly or covertly). Information-theoretic stealth constraints (e.g., KL-divergence or empirical state transition match) define trade-offs between the attacker’s impact and detectability (Russo et al., 2021, Santi et al., 28 Jan 2025). Adversarial policies can maximize the degradation in return while remaining undetected or within desired error-exponent thresholds.

5. Extensions: Delayed, Partial, and Bandit Feedback

Realistic reinforcement learning scenarios often involve feedback limitations, further complicating adversarial MDPs:

Delayed Feedback: When adversarial costs/losses are revealed with delay (possibly adversarial and unbounded), explicit algorithms achieve regret scaling in $\sqrt{K+D}$ , where $D$ is the cumulative delay (Lancewicki et al., 2020).
Bandit Feedback: Observation only of realized losses necessitates importance-weighted and optimistic estimators, leading to more challenging regret scaling ( $\tilde O(T^{3/4})$ or $O(K^{2/3})$ in some settings) (Arora et al., 2012, Schlisselberg et al., 10 Feb 2026). The structure of the transition dynamics critically determines whether exponential regret can be avoided.
Preference/Borda Feedback: When only preferences are observed (as in dueling bandit RL), the learning problem is strictly harder than adversarial numeric loss feedback, resulting in $T^{2/3}$ minimax regret (Tsuchiya et al., 15 Jul 2025).

6. Open Problems and Research Directions

Recent advances in adversarial MDPs highlight several prominent open issues:

Computational Efficiency vs. Approximation: While OMD and FPL yield minimax regret, efficient realization of some structured approaches (e.g., conditioned occupancy for large $\Lambda$ ) remains open, particularly when identifying adversarial transition steps adaptively (Schlisselberg et al., 10 Feb 2026).
Closing the Regret Gap: The gap between stochastic and adversarial minimax rates in $S$ , $A$ , $H$ has been partially closed using PO/OLO techniques, but optimal $H$ -dependence, especially under partial feedback, is not fully resolved (Tiapkin et al., 2024).
Adaptive/Non-oblivious Adversaries: Most results assume oblivious adversaries (fixed loss/transition sequences appointed without knowledge of the agent's actions), but extensions to adaptive adversaries (that react to learner strategy) are less mature (Lancewicki et al., 2020).
Function Approximation: Extending adversarial MDP minimax analysis to general function approximation (beyond tabular or linear settings) and rich observation spaces is an active area (Lancewicki et al., 2020, He et al., 2021).
Hard Constraints and Online Feasibility: Achieving adversarial regret guarantees under strict safety or cost constraints, especially without knowledge of strictly feasible policies, imposes additional complexity and is not fully understood (Stradi et al., 2024).
Optimal Security-Performance Tradeoff: Characterizing fundamental limits and efficient computation for covert attacks and robust defense in adversarial MDPs—particularly when the environment is partially observable or feedback is delayed—is evolving rapidly (Russo et al., 2021, Santi et al., 28 Jan 2025).

7. Significance and Broader Impact

Adversarial MDPs unify key threads in online learning, robust control, convex optimization, and security, driving theory and practice in designing agents resilient to worst-case scenarios across RL applications. This framework contributes to the theoretical foundations for robust RL algorithms with quantifiable guarantees in adversarial, nonstationary, and safety-critical systems. Techniques developed in this context—conditioned occupancy measures, refined optimistic planning, black-box online policy optimization—are broadly influential for both investigation of RL limits and the design of practical algorithms that are robust, adaptive, and reliable (Herremans et al., 2024, Schlisselberg et al., 10 Feb 2026).