Markov Decision Processes (MDPs) Overview

Updated 18 November 2025

Markov Decision Processes (MDPs) are discrete-time stochastic models for sequential decision-making, defined by states, actions, rewards, and discount factors.
They underpin reinforcement learning and operations research, providing frameworks for robust, risk-sensitive, and high-dimensional control problems.
Recent developments include measurized MDPs, probabilistic constraints, and representation learning techniques that enhance policy synthesis and optimization.

A Markov Decision Process (MDP) is a discrete-time stochastic control process that models sequential decision-making under uncertainty. An MDP is defined by a state space, an action space, a transition mechanism, a reward function, and a discount factor. The aim is to synthesize policies that optimize an expected cumulative objective. MDPs form the core mathematical object in stochastic control, reinforcement learning, and operations research. The contemporary literature extends the MDP framework to incorporate general state/action spaces, robust and risk-sensitive objectives, learning from incomplete information, and domain-specific constraints.

1. Formal Structure and Canonical Properties

Let $(S, \mathcal{B}(S))$ be a Borel state space and $U$ a Borel action space. The canonical, discrete-time MDP is a tuple

$\big(S,\, U,\, Q,\, r,\, \alpha\big)$

where

$Q(\cdot|s,u)$ is a Markov transition kernel: for each $s \in S$ , $u \in U$ , $Q(\cdot|s,u)$ is a probability measure on $S$ ;
$r : S \times U \rightarrow \mathbb{R}$ is a one-stage reward;
$\alpha \in (0,1)$ is a discount factor.

A policy is a measurable mapping $\pi : S \rightarrow \mathcal{P}(U)$ assigning actions (possibly randomized) to each state. The value function under policy $\pi$ is

$V^{\pi}(s) = \mathbb{E}_\pi\left[\,\sum_{t=0}^{\infty}\alpha^t r(s_t, a_t) \,\mid\, s_0 = s\,\right].$

The optimal value function satisfies the Bellman optimality equations:

$V^*(s) = \sup_{u \in U} \Big\{ r(s,u) + \alpha \int_S V^*(s') Q(ds'|s,u) \Big\}.$

For $U$ finite and $S$ countable, these reduce to the classical Bellman recurrences. Similar expressions underly the average-reward and constrained MDPs.

2. Existence, Measurability, and Semicontinuous–Semicompact Framework

When $S$ and $U$ are general Borel spaces, issues of measurability and continuity become paramount. The semicontinuous–semicompact framework, as developed by Hernández-Lerma and Lasserre, casts the MDP in a setting where the reward $r(s,u)$ is upper semicontinuous and bounded above, and the transition kernel $Q(\cdot|s,u)$ depends continuously (in the weak topology) on $(s,u)$ . Under these and mild integrability assumptions, the Bellman operator admits fixed points in the space of bounded Borel-measurable functions; optimal measurable selectors exist, yielding stationary optimal policies. This facilitates extension to constrained and infinite-horizon models without the technicalities of universally measurable selectors (Adelman et al., 6 May 2024).

3. Lifting and the Measurized MDP Formalism

A crucial generalization is the measurized MDP, whereby the state space is lifted from points $s \in S$ to probability measures $\nu$ on $S$ , formulated within the weak topology. The measurized MDP is specified by the tuple

$\big(M,\, \Phi,\, \{\Phi(\nu)\}_{\nu \in M},\, q,\, r \big),$

where $M$ is the set of probability measures on $(S, \mathcal{B}(S))$ , $\Phi$ is the set of Markov decision rules (stochastic kernels), $q$ encodes deterministic transitions on $M$ via

$F(\nu, \phi)(\cdot) = \int_S \int_{U(s)} Q(\cdot|s,u)\, \phi(du|s)\, \nu(ds),$

and $r(\nu,\phi) = \int_S \int_{U(s)} r(s,u)\, \phi(du|s)\, \nu(ds)$ is the lifted one-stage reward. The Bellman equations in this setting admit the same structure, with value functions $V^*\in C_b(M)$ (bounded Borel-measurable functions on $M$ ), and Borel-measurable selectors guaranteeing optimal stationary product policies. For Dirac- $\delta_s$ measures, this framework recovers the original MDP, without loss of fidelity:

$V^*(\delta_s) = V^{\text{orig}}(s),\quad \text{and}\quad V^*(\nu) = \int_S V^{\text{orig}}(s)\,\nu(ds).$

The framework naturally incorporates external random shocks, risk or chance constraints, and supports nuanced approximation architectures (Adelman et al., 6 May 2024).

4. Constraints and Value Function Approximations

The lifting approach enables constraints and approximations not expressible in the classical space:

Risk constraints (e.g., CVaR): By restricting $\Phi(\nu)$ to policies satisfying

$\operatorname{CVaR}_\beta(c;\nu,\phi) \leq \theta,$

with $c(s,u)$ the cost and $\operatorname{CVaR}_\beta$ defined in terms of the conditional tail expectation. The Bellman equation then optimizes over this risk-constrained policy set.

Probabilistic state constraints: For instance, bounding the variance $\operatorname{Var}_{\nu,\phi}(u) \leq \sigma^2$ further restricts feasible $\phi$ , affecting the supremum in the Bellman update.
Value function approximations: By expanding $V^*(\nu)$ in basis functions (moments, Laplace transforms, divergences) and solving for weights $w$ in

$V(\nu) \approx \sum_k w_k \psi_k(\nu),$

the MDP solution reduces to a convex optimization over the function class (Adelman et al., 6 May 2024).

5. Representation Learning and State Compression

A central technical challenge in high-dimensional domains is constructing low-dimensional, Markovian feature representations. The formalism of feature Markov Decision Processes ( $\Phi$ MDPs) encodes any history compression $\Phi:\mathcal{H}\to S$ (from history space to state space) and defines penalized code-length cost functions to score candidate $\Phi$ . The optimal $\Phi$ minimizes this cost, yielding an induced process that is approximately Markov (0812.4580). Alternating deep neural networks (ADNN) further automate this discovery, learning encoders $\phi$ such that the reduced process is itself Markov and sufficient for optimal control; conditional independence criteria (residual tests, Brownian distance covariance) ensure fidelity with the original process, and group-lasso regularization yields sparsity for interpretability (Wang et al., 2017).

6. Robust, Risk-Sensitive, and Distributionally Robust Extensions

MDPs are generalized to address model uncertainty (robust MDPs, RMDPs) and risk via recursive risk measures:

Robust MDPs: Transition probabilities are uncertain within nonempty convex sets $U(s,a)$ , yielding robust Bellman equations of the form

$V^*(s) = \max_{a}\;\min_{P\in U(s,a)}\Big\{ R(s,a) + \gamma \sum_{s'} P(s'|s,a)V^*(s') \Big\},$

which is equivalent to a zero-sum stochastic game with nature (Suilen et al., 18 Nov 2024).

Risk-sensitive MDPs: The objective is the recursive application of coherent, law-invariant risk measures $\rho$ ; the value function solves

$v^*(x) = \inf_{a\in D(x)}\;\rho\Big( c(x,a,T(x,a,Z)) + \beta\,v^*(T(x,a,Z)) \Big).$

Under mild contractivity, unique fixed points and optimal stationary policies exist; the dual representation of $\rho$ provides a direct link to distributionally robust control (Bäuerle et al., 2020).

Distributionally robust chance-constrained MDPs: Ambiguity in the reward distribution is handled via moment, $\varphi$ -divergence or Wasserstein sets. The robust optimization reduces to tractable SOCP/MISOCP or copositive programs, yielding policies with formal chance constraint guarantees (Nguyen et al., 2022).

7. Learning Under Partial Knowledge, Unawareness, and Nonstandard Extensions

MDPs have been extended to address partial knowledge, unawareness, non-cumulative objectives, and non-stationarity:

Learning with Unawareness: In MDPs with unawareness (MDPUs), the agent is initially unaware of part of the action set. Discovery is modelled via an “explore” action and a stochastic discovery process $D(j,t)$ . Near-optimal play is possible iff $\sum_t D(1,t) = \infty$ , with polynomial-time learning characterized by the growth rate of this sum (Halpern et al., 2014).
Non-cumulative Objectives: For decision processes with non-cumulative return functions (e.g., maximizing the maximum reward over time), construction of a “lifted” MDP with an augmented state capturing the statistic of the past reward sequence enables reduction to standard RL and dynamic programming algorithms—retaining optimality (Nägele et al., 22 May 2024).
Non-stationary or Externally Modulated MDPs: When transitions depend on external temporal processes or histories, the formalism augments the state with finite-memory histories; under suitable decay of exogenous effects, optimality is achieved via truncated history, with error bounded in terms of total variation (Ayyagari et al., 2023).

In summary, the theory of MDPs encompasses discrete- and continuous-parameter models, broad classes of uncertainties and constraints, and advanced frameworks for representation, approximation, and robust/learning-based control. Lifting to measure spaces, compositional risk constraints, and function-approximation architectures exemplify ongoing innovations (Adelman et al., 6 May 2024). The contemporary MDP is not only a canonical model for sequential stochastic decision making but also a substrate on which generalizations for robustness, risk sensitivity, high-dimensionality, unawareness, and statistical learning are rigorously built.