Markov Decision Process Formulation

Updated 27 January 2026

Markov decision process formulation is a rigorous mathematical framework that models sequential decision-making using defined state and action spaces, probabilistic transitions, and reward structures.
It integrates approaches such as linear programming and dynamic programming, leveraging Bellman recursions and occupation measures to derive optimal policies.
This formulation extends to robust, risk-averse, and multi-objective applications across domains like healthcare, engineering, and resource allocation.

A Markov decision process (MDP) formulation is a mathematically rigorous approach to modeling sequential decision-making under uncertainty, where an agent interacts dynamically with an environment characterized by probabilistic transitions and systematically optimizes cumulative objectives—often subject to a broad range of constraints and generalizations. The core of an MDP formulation is the explicit representation of state and action spaces, transition dynamics, reward/cost structures, and information flow, formalized via algebraic and measure-theoretic constructs to enable tractable optimization, policy synthesis, and structural analysis across a wide class of domains and problem variants.

1. Formal Structure and Elements

An MDP is defined as a tuple $(\mathcal{S}, \mathcal{A}, P, r, \gamma)$ , where:

$\mathcal{S}$ : Finite or measurable state space.
$\mathcal{A}$ : Finite or Borel action space; possibly state-dependent $\mathcal{A}(s)$ .
$P$ : Transition kernel $P(s'|s,a)$ , specifying the probability of moving to $s'$ from $s$ when $a$ is chosen.
$r$ : Reward (or cost) function $r:\mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}$ .
$\gamma$ : Discount factor in $[0,1]$ controlling the time preference of rewards.

The agent selects actions (possibly randomized) according to a policy $\pi(a|s)$ , generating a controlled (Markovian) stochastic process $(s_0,a_0,s_1,a_1,\ldots)$ . The objective is typically to maximize (or minimize) an expected cumulative functional of the form

$J^\pi = \mathbb{E}^\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\right],$

subject to various constraints or generalizations, such as risk aversion, constraints on accumulated costs, or robustness to model uncertainty.

2. Occupation Measure and Linear Programming Formulations

The occupation measure approach recasts the MDP as a static optimization problem over nonnegative measures $\mu(s,a)$ satisfying balance constraints encoding the Markov and policy dynamics. For finite or Borel state/action spaces, the linear programming (LP) formulations are:

Discounted-cost, finite state-action: Minimize $\sum_{s,a} \mu(s,a) c(s,a)$ , subject to

$\sum_a \mu(s',a) = (1-\gamma)\alpha(s') + \gamma\sum_{s,a}P(s'|s,a)\mu(s,a), \qquad \forall s',$

where $\alpha$ is the initial state distribution, and nonnegativity constraints $\mu(s,a)\geq 0$ (Ying et al., 2020, Costa et al., 2014).

Infinite or expected-total-reward with constraints: Optimize over occupation measures or kernel pairs $(\varphi^0,\varphi^\infty)$ , enforcing flow-balance and constraint satisfaction. The generalized convex program is

$\max_{\Phi\in \mathcal{K}_p} \mu_\Phi(r)\;\;\text{subject to}\;\;\mu_\Phi(c_i)\geq \theta_i,$

with

$\mu_\Phi(\cdot) = \nu_0(\cdot) + \int_{X\times A}Q(\cdot|x,a)\,\mu_\Phi(dx,da)$

for general Borel spaces (Dufour et al., 2019).

Vector-valued LPs for multi-objective MDPs: The vector LP formulation characterizes the Pareto-efficient policy set for finite-horizon, vector-reward MDPs via occupation measures and flow-balance constraints. Efficient deterministic policies correspond to efficient (Pareto) vertices of the feasible region (Mifrani et al., 19 Feb 2025).

3. Dynamic Programming and Bellman Equations

The fundamental principle underlying MDP formulations is the Bellman recursion: $V(s) = \max_{a\in \mathcal{A}(s)} \left\{ r(s,a) + \gamma \sum_{s'}P(s'|s,a)V(s') \right\},$ establishing a fixed-point relation characterizing the optimal value function $V^*$ . This recursion generalizes to:

Average-reward (undiscounted) and entropy-regularized cases via appropriate modifications to the reward structure and the addition of entropy in the objective (Ying et al., 2020).
Constrained and risk-averse cases where the Bellman inequality incorporates Lagrange multipliers and coherent/convex risk mappings (Ahmadi et al., 2020).

Notably, in infinite-horizon discounted MDPs, the LP relaxation and Bellman recursion are strongly dual, with the LP optimal dual variables corresponding to stationary occupation measures and the primal variables to value functions.

4. Generalized and Robust MDP Formulations

a. Constrained and Risk-Averse MDPs

Extended MDP formulations incorporate resource or risk constraints:

Multi-modality cancer therapy optimization models state as a vector of flags and indices, action space as treatment types, and enforces constraints via state transitions and absorbing boundaries (Maass et al., 2017).
Risk-averse MDPs: Nested dynamic coherent risk measures $\rho_t(\cdot)$ replace expectation, encoding time-consistent risk-averse objectives such as CVaR or entropic VaR, leading to Bellman equations with risk transition mapping and optimization over risk-augmented value functions. The resulting optimization problems are difference-convex (DC) and solved via disciplined convex-concave programming (DCCP) (Ahmadi et al., 2020).

b. Robust MDPs

Robust MDPs (RMDPs) address ambiguity in transition probabilities, utilizing rectangular (sa-rectangular and s-rectangular) or non-rectangular uncertainty sets (e.g., $L_p$ balls) (Grand-Clément et al., 2022, Kumar et al., 13 Feb 2025). The robust Bellman operator is

$(Tv)_s = \max_a \min_{p\in U_{sa}} [ r_{sa} + \gamma p^\top v ],$

with tractable convex reformulations achieved via entropic regularization and exponential change of variables, leading to conic programming problems involving exponential or quadratic cones. Non-rectangular $L_p$ uncertainty balls, while non-decomposable, can be expressed as unions of $sa$ -rectangular sets, and their dual solutions exhibit rank-one perturbation structure and tight policy evaluation bounds (Kumar et al., 13 Feb 2025).

5. Structural and Algorithmic Insights

MDP formulations enable a range of structural and computational developments:

Occupation measure LPs provide a unifying paradigm for finite-horizon, infinite-horizon, vector-valued, or risk-constrained problems (Ying et al., 2020, Dufour et al., 2019, Mifrani et al., 19 Feb 2025).
Duality and value/occupancy key: The primal (value-function) and dual (state-action occupancy) LPs are strongly dual; dual multipliers in the LP correspond to discounted visitation frequencies, and optimal policies are derived from primal-dual solutions.
Exact policy recovery: Complementary slackness in the LP directly yields deterministic or randomized policies—e.g., $\pi^*(a|s)$ supported on actions tight in Bellman inequalities.
Dynamic programming and decomposition: Bellman fixed-point equations, multi-objective extensions, and scenario-based decompositions (e.g., for bilevel or robust MDPs) facilitate advanced synthesis and sensitivity analysis (Brown et al., 2023, Grand-Clément et al., 2022).
Handling large or infinite spaces: Measure-theoretic LPs extend MDP formulations to Borel spaces, with technical conditions (measurability, semicontinuity, Lyapunov growth) ensuring equivalence and solvability (Costa et al., 2014, Dufour et al., 2019).

6. Application-Driven MDP Formulations

The versatility of MDP formulation is reflected in diverse applications:

Continuous-time and hybrid models: PDMPs (piecewise deterministic Markov processes) use embedded discrete-time chains and infinite-dimensional LPs to capture systems with deterministic flows and stochastic jumps, relevant for stochastic hybrid systems and queuing networks (Costa et al., 2014).
Finite-horizon, multi-objective engineering: Vector-LP MDPs address multi-criteria design and scheduling, explicitly characterizing the Pareto-efficient policy set via occupation measures (Mifrani et al., 19 Feb 2025).
Resource allocation and social welfare: Water-filling convex programs for fair allocations under demand uncertainty leverage MDP structural properties to design policies with rigorous guarantees on fairness and efficiency (Hassanzadeh et al., 2023).
Learning and control in complex settings: MDP reformulations of video generation (Yushchenko et al., 2019), branch-and-bound variable selection (Strang et al., 22 Oct 2025), and satellite scheduling (Eddy et al., 2019) demonstrate the methodological breadth of MDP modeling, including integration with RL and tree search.

7. Theoretical and Computational Guarantees

The equivalences between MDP dynamic programming, LP, and policy-gradient formulations ensure convergence properties and optimality of synthesized policies under broad conditions:

Existence of an optimal policy, often stationary and deterministic (or suitably randomized), is guaranteed by strong duality and complementary slackness under mild regularity assumptions (Ying et al., 2020, Dufour et al., 2019, Mifrani et al., 19 Feb 2025).
Robust, risk-averse, and bilevel generalizations preserve computational tractability through advanced convex programming, conic formulations, and decomposition schemes, with provable performance bounds (Grand-Clément et al., 2022, Ahmadi et al., 2020, Brown et al., 2023).
Occupation measure and LP-based methods generalize seamlessly to constraints, multi-objective, risk, and robustness settings, underpinning both exact and scalable approximate solution algorithms across stochastic control and optimization domains.

References:

"A Linear Programming Formulation for Constrained Discounted Continuous Control for Piecewise Deterministic Markov Processes" (Costa et al., 2014)
"A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion" (Dufour et al., 2019)
"Linear programming for finite-horizon vector-valued Markov decision processes" (Mifrani et al., 19 Feb 2025)
"A Note on Optimization Formulations of Markov Decision Processes" (Ying et al., 2020)
"On the convex formulations of robust Markov decision processes" (Grand-Clément et al., 2022)
"Dual Formulation for Non-Rectangular Lp Robust Markov Decision Processes" (Kumar et al., 13 Feb 2025)
"Constrained Risk-Averse Markov Decision Processes" (Ahmadi et al., 2020)
"Markov Decision Process Design: A Framework for Integrating Strategic and Operational Decisions" (Brown et al., 2023)
"Sequential Fair Resource Allocation under a Markov Decision Process Framework" (Hassanzadeh et al., 2023)
"Markov decision process approach to optimizing cancer therapy using multiple modalities" (Maass et al., 2017)
"Markov Decision Process for Video Generation" (Yushchenko et al., 2019)
"A Markov Decision Process for Variable Selection in Branch & Bound" (Strang et al., 22 Oct 2025)
"Markov Decision Processes For Multi-Objective Satellite Task Planning" (Eddy et al., 2019)
"Finite-Horizon Markov Decision Processes with Sequentially-Observed Transitions" (Chamie et al., 2015)