Markov Decision Processes (MDPs)

Updated 5 December 2025

Markov Decision Processes (MDPs) are formal models defined by states, actions, probabilistic transitions, rewards, and a discount factor to optimize cumulative performance.
They are solved using dynamic programming methods such as value iteration, policy iteration, and linear programming, ensuring convergence to optimal policies.
Extensions including constraints, risk sensitivity, and robust formulations enable MDPs to address complex, real-world sequential decision-making challenges.

A Markov Decision Process (MDP) is a foundational formalism for modeling sequential decision-making under stochastic dynamics. MDPs are widely used across operations research, artificial intelligence, reinforcement learning, control theory, and related domains. An MDP encodes the evolution of a system whose transitions are governed by both a controller's actions and probabilistic dynamics, with the objective of synthesizing a policy that optimizes a cumulative performance criterion.

1. Mathematical Definition and Structure

An MDP is defined by the tuple $(S, A, P, R, \gamma)$ :

$S$ : finite or measurable state space.
$A$ : finite or measurable action space; $A(s)$ denotes the set of admissible actions in state $s$ .
$P: S \times A \times S \to [0,1]$ : transition kernel, with $P(s'|s, a)$ giving the probability of transitioning from $s$ to $s'$ under action $a$ .
$R: S \times A \to \mathbb{R}$ : immediate reward function.
$\gamma \in [0,1)$ : discount factor (for infinite-horizon discounted reward).

The agent's behavior is specified by a policy $\pi$ ; for stationary Markov policies, $\pi: S \to A$ or $\pi: S \to \Delta(A)$ in the randomized case. The induced trajectory $\{s_t, a_t\}$ evolves as $a_t \sim \pi(s_t)$ , $s_{t+1} \sim P(\cdot|s_t, a_t)$ . The value function under policy $\pi$ is

$V^\pi(s) = \mathbb{E}_\pi \left[ \sum_{t=0}^\infty \gamma^t R(s_t, a_t) \,|\, s_0 = s \right],$

and the optimal value $V^*(s) = \max_\pi V^\pi(s)$ . The Bellman optimality equations read

$V^*(s) = \max_{a \in A(s)} \left\{ R(s, a) + \gamma \sum_{s'} P(s'|s, a) V^*(s') \right\},$

with the $Q$ -function recursively defined as

$Q^*(s, a) = R(s, a) + \gamma \sum_{s'} P(s'|s, a) \max_{a'} Q^*(s', a').$

MDPs also admit average-reward, finite-horizon, and risk-sensitive extensions (Suilen et al., 18 Nov 2024, Bäuerle et al., 2020, Chamie et al., 2015, Chamie et al., 2015).

2. Classical Solution Methods and Computational Properties

Canonical methods for solving MDPs exploit the contraction properties of the Bellman operator. The main dynamic programming algorithms are:

Value Iteration: Repeatedly apply the Bellman backup $V_{k+1}(s) = \max_a [R(s, a) + \gamma \sum_{s'} P(s'|s,a) V_k(s')]$ until convergence to within a specified tolerance.
Policy Iteration: Alternates policy evaluation (solving for $V^\pi$ for the current policy) and policy improvement (greedily choosing actions maximizing the evaluated $V$ ).
Linear Programming Formulation: Particularly for finite-horizon and some constrained MDPs, the value function can also be recovered as the solution to a linear program over state(-action) occupation measures (Chamie et al., 2015, Chamie et al., 2015).

The computational complexity of value iteration is $O(|S|^2 |A| / (1-\gamma) \log(1/\varepsilon))$ for discounted MDPs with finite $S$ and $A$ (Suilen et al., 18 Nov 2024). Policy iteration converges in a finite number of steps but may be exponential in $|S|$ in the worst case; in practice, it is often much faster. For large-scale problems, function approximation (linear or deep) is deployed, and sample-based reinforcement learning algorithms (e.g., $Q$ -learning, Blackwell $Q$ -learning) can be employed (Li et al., 2020, Wang et al., 2017).

3. Extensions: Constraints, Uncertainty, and Risk

State and Action Constraints

Constrained MDPs (CMDPs) impose constraints on the probability distribution over states or actions, e.g., $Bx_t \leq d$ for the marginal state distribution $x_t$ (Chamie et al., 2015). The optimal policy typically requires randomization and may be non-stationary, as constraints destroy the decoupled Bellman recursion. Solving finite-horizon CMDPs can be accomplished via a sequence of LPs over decision-rule polytopes, with projection heuristics available for feasible policy selection.

Model Uncertainty: Robust and Multi-Objective MDPs

Robust MDPs (RMDPs) generalize transitions $P$ to lie in an uncertainty set $\mathcal{P}$ , often $s$ - $a$ -rectangular (i.e., $\mathcal{P} = \prod_{s,a} \mathcal{P}_{s,a}$ ). The robust Bellman operator becomes

$(T^{\rm rob} V)(s) = \max_{a} \left[ R(s,a) + \gamma \min_{P(\cdot|s,a)\in \mathcal{P}_{s,a}} \sum_{s'} P(s'|s,a) V(s') \right].$

Robust value/policy iteration adapts directly. Special cases include interval MDPs (IMDPs), $L_1$ -MDPs, and multi-environment MDPs. These are foundational for quantitative verification, model-based RL with uncertainty, and distributionally robust control (Suilen et al., 18 Nov 2024, Scheftelowitsch et al., 2017).

MDPs with uncertain transitions and rewards can be treated in a multi-objective fashion, optimizing, e.g., worst-case, best-case, and average-case performance simultaneously. Pareto-optimal policy sets can be computed using multi-objective policy iteration and heuristic methods (Scheftelowitsch et al., 2017).

Risk Sensitivity and Recursion

Risk-sensitive MDPs replace the expectation operator in Bellman recursion with a coherent (or convex) risk measure $\rho$ , yielding recursive value equations of the form

$J_n(x) = \inf_{a \in D_n(x)} \rho_n \left( c_n(x, a, T_n(x, a, Z)) + J_{n+1}(T_n(x, a, Z)) \right).$

This encompasses entropic risk, distortion risk, and links to distributionally robust MDPs via their dual representation (Bäuerle et al., 2020).

4. Representational Issues and High-Dimensional State Spaces

For real-world problems, constructing a suitable state representation—ensuring the Markov property with minimal dimension—is critical. Feature MDPs formalize the problem as finding a mapping $\Phi: H \to S$ from histories to finite feature-states, optimized by a minimum description-length criterion that trades off model complexity against predictive Markovianity (0812.4580).

Advances in sufficient representations via deep learning consider measurable mappings $\phi: S \to \mathbb{R}^q$ such that conditional independence and performance preservation conditions are satisfied, with alternating deep neural networks and statistical dependence minimization used to select $\phi$ (Wang et al., 2017). Verification of sufficiency and sparsity is established through empirical independence testing and penalized estimation.

5. Specialized MDP Classes, Algorithmic Innovations, and Approximations

Nonstandard and Generalized MDPs

Sequentially Observed Transitions: MDPs where the agent gains sequential, partial lookahead of transition outcomes, enabling policies that outperform classical approaches via dynamic programming and LP decomposition over extended decision variables (Chamie et al., 2015).
Synchronizing Objectives: An MDP is synchronizing if under some strategy, the probability mass eventually (or infinitely often) concentrates nearly on a single state. Deciding the existence of strategies for strong and weak synchronizing objectives is decidable by subset construction, but memoryless strategies do not suffice in general (Doyen et al., 2011).
Self-Triggered MDPs: Extensions where the agent intermittently updates its control action, co-optimizing a holding time and action to trade off resource usage against performance—solved via DP with optimized lookahead and explicit constraints (Huang et al., 2021).

Online, Nonstationary, and Blackwell-Approachability MDPs

Online MDPs generalize classical models to allow changing reward functions across episodes, measuring regret with respect to the best stationary policy in hindsight. Policy iteration algorithms with sublinear regret bounds, and scalable versions with function approximation, are available (Ma et al., 2015).

Blackwell approachability provides an online optimization perspective where the policy is the decision variable and the Q-vector is interpreted as payoff feedback. Planning can be cast as a repeated vector pay-off game, with regret-matching driving convergence to optimality (Li et al., 2020).

Temporal Decomposition and Economic MPC

Temporal concatenation splits a long-horizon MDP into blocks, solves each subproblem individually and concatenates policies, trading a quantifiable performance loss (regret) for substantial computation speed-up (Song et al., 2020).

Economic model predictive control (EMPC) provides a tractable, receding-horizon heuristic. Under specific conditions—terminal cost equals optimal value function and commutation up to constant between true Bellman operator and deterministic model—EMPC may achieve optimal or nearly optimal closed-loop performance, though this equivalence is restricted to particular structure classes (deterministic, LQG, locally dissipative) (Reinhardt et al., 23 Jul 2024).

6. Generalizations: External Processes, Partially Observable, and Generalized Linear MDPs

MDPs under External Temporal Processes: The incorporation of exogenous, non-Markovian event processes alters transition kernels and rewards. Under summable “forgetting” in the influence of past events, one can truncate history, apply standard policy iteration on the truncated system, and bound the approximation error as a function of memory length (Ayyagari et al., 2023).
Generalized Linear MDPs (GLMDPs): The linear MDP framework is extended to allow nonlinear (e.g., logistic, Poisson) reward models via generalized linear models, preserving tractable Bellman completeness with respect to a lifted function class. Bellman operators, sample-efficient pessimistic value iteration, and semi-supervised extensions for reward-scarce offline RL are realized with non-asymptotic performance guarantees (Zhang et al., 1 Jun 2025).

7. Applications, Impact, and Research Directions

MDPs constitute the theoretical underpinning of much of modern sequential decision making in AI and operations research. They serve as the modeling basis for reinforcement learning algorithms, stochastic control, verification in formal methods, and robust and risk-sensitive planning. Extensions to robust, constrained, risk-sensitive, and partially observable scenarios continue to stimulate algorithmic and methodological innovation.

Open problems include tight memory and complexity bounds for strategies with advanced objectives (Doyen et al., 2011), scalable robust and multi-objective policy synthesis (Suilen et al., 18 Nov 2024), integration of MDP planning and deep function approximation, and characterization of optimality gaps and sample complexity in new specializations such as those with generalized rewards (Zhang et al., 1 Jun 2025), sequential side information (Chamie et al., 2015), and nonstationary environments (Ayyagari et al., 2023).

MDPs remain a central formalism, unifying diverse perspectives from dynamic programming, online learning, convex optimization, and AI planning. Recent developments demonstrate both the continuing evolution of foundational theory and its adaptation to high-dimensional, data-driven, and safety-critical domains.