Papers
Topics
Authors
Recent
Search
2000 character limit reached

Markov Decision Process Formulation

Updated 27 January 2026
  • Markov decision process formulation is a rigorous mathematical framework that models sequential decision-making using defined state and action spaces, probabilistic transitions, and reward structures.
  • It integrates approaches such as linear programming and dynamic programming, leveraging Bellman recursions and occupation measures to derive optimal policies.
  • This formulation extends to robust, risk-averse, and multi-objective applications across domains like healthcare, engineering, and resource allocation.

A Markov decision process (MDP) formulation is a mathematically rigorous approach to modeling sequential decision-making under uncertainty, where an agent interacts dynamically with an environment characterized by probabilistic transitions and systematically optimizes cumulative objectives—often subject to a broad range of constraints and generalizations. The core of an MDP formulation is the explicit representation of state and action spaces, transition dynamics, reward/cost structures, and information flow, formalized via algebraic and measure-theoretic constructs to enable tractable optimization, policy synthesis, and structural analysis across a wide class of domains and problem variants.

1. Formal Structure and Elements

An MDP is defined as a tuple (S,A,P,r,γ)(\mathcal{S}, \mathcal{A}, P, r, \gamma), where:

  • S\mathcal{S}: Finite or measurable state space.
  • A\mathcal{A}: Finite or Borel action space; possibly state-dependent A(s)\mathcal{A}(s).
  • PP: Transition kernel P(ss,a)P(s'|s,a), specifying the probability of moving to ss' from ss when aa is chosen.
  • rr: Reward (or cost) function r:S×ARr:\mathcal{S}\times \mathcal{A}\rightarrow \mathbb{R}.
  • γ\gamma: Discount factor in [0,1][0,1] controlling the time preference of rewards.

The agent selects actions (possibly randomized) according to a policy π(as)\pi(a|s), generating a controlled (Markovian) stochastic process (s0,a0,s1,a1,)(s_0,a_0,s_1,a_1,\ldots). The objective is typically to maximize (or minimize) an expected cumulative functional of the form

Jπ=Eπ[t=0γtr(st,at)],J^\pi = \mathbb{E}^\pi\left[\sum_{t=0}^\infty \gamma^t r(s_t,a_t)\right],

subject to various constraints or generalizations, such as risk aversion, constraints on accumulated costs, or robustness to model uncertainty.

2. Occupation Measure and Linear Programming Formulations

The occupation measure approach recasts the MDP as a static optimization problem over nonnegative measures μ(s,a)\mu(s,a) satisfying balance constraints encoding the Markov and policy dynamics. For finite or Borel state/action spaces, the linear programming (LP) formulations are:

  • Discounted-cost, finite state-action: Minimize s,aμ(s,a)c(s,a)\sum_{s,a} \mu(s,a) c(s,a), subject to

aμ(s,a)=(1γ)α(s)+γs,aP(ss,a)μ(s,a),s,\sum_a \mu(s',a) = (1-\gamma)\alpha(s') + \gamma\sum_{s,a}P(s'|s,a)\mu(s,a), \qquad \forall s',

where α\alpha is the initial state distribution, and nonnegativity constraints μ(s,a)0\mu(s,a)\geq 0 (Ying et al., 2020, Costa et al., 2014).

  • Infinite or expected-total-reward with constraints: Optimize over occupation measures or kernel pairs (φ0,φ)(\varphi^0,\varphi^\infty), enforcing flow-balance and constraint satisfaction. The generalized convex program is

maxΦKpμΦ(r)    subject to    μΦ(ci)θi,\max_{\Phi\in \mathcal{K}_p} \mu_\Phi(r)\;\;\text{subject to}\;\;\mu_\Phi(c_i)\geq \theta_i,

with

μΦ()=ν0()+X×AQ(x,a)μΦ(dx,da)\mu_\Phi(\cdot) = \nu_0(\cdot) + \int_{X\times A}Q(\cdot|x,a)\,\mu_\Phi(dx,da)

for general Borel spaces (Dufour et al., 2019).

  • Vector-valued LPs for multi-objective MDPs: The vector LP formulation characterizes the Pareto-efficient policy set for finite-horizon, vector-reward MDPs via occupation measures and flow-balance constraints. Efficient deterministic policies correspond to efficient (Pareto) vertices of the feasible region (Mifrani et al., 19 Feb 2025).

3. Dynamic Programming and Bellman Equations

The fundamental principle underlying MDP formulations is the Bellman recursion: V(s)=maxaA(s){r(s,a)+γsP(ss,a)V(s)},V(s) = \max_{a\in \mathcal{A}(s)} \left\{ r(s,a) + \gamma \sum_{s'}P(s'|s,a)V(s') \right\}, establishing a fixed-point relation characterizing the optimal value function VV^*. This recursion generalizes to:

  • Average-reward (undiscounted) and entropy-regularized cases via appropriate modifications to the reward structure and the addition of entropy in the objective (Ying et al., 2020).
  • Constrained and risk-averse cases where the Bellman inequality incorporates Lagrange multipliers and coherent/convex risk mappings (Ahmadi et al., 2020).

Notably, in infinite-horizon discounted MDPs, the LP relaxation and Bellman recursion are strongly dual, with the LP optimal dual variables corresponding to stationary occupation measures and the primal variables to value functions.

4. Generalized and Robust MDP Formulations

a. Constrained and Risk-Averse MDPs

Extended MDP formulations incorporate resource or risk constraints:

  • Multi-modality cancer therapy optimization models state as a vector of flags and indices, action space as treatment types, and enforces constraints via state transitions and absorbing boundaries (Maass et al., 2017).
  • Risk-averse MDPs: Nested dynamic coherent risk measures ρt()\rho_t(\cdot) replace expectation, encoding time-consistent risk-averse objectives such as CVaR or entropic VaR, leading to Bellman equations with risk transition mapping and optimization over risk-augmented value functions. The resulting optimization problems are difference-convex (DC) and solved via disciplined convex-concave programming (DCCP) (Ahmadi et al., 2020).

b. Robust MDPs

Robust MDPs (RMDPs) address ambiguity in transition probabilities, utilizing rectangular (sa-rectangular and s-rectangular) or non-rectangular uncertainty sets (e.g., LpL_p balls) (Grand-Clément et al., 2022, Kumar et al., 13 Feb 2025). The robust Bellman operator is

(Tv)s=maxaminpUsa[rsa+γpv],(Tv)_s = \max_a \min_{p\in U_{sa}} [ r_{sa} + \gamma p^\top v ],

with tractable convex reformulations achieved via entropic regularization and exponential change of variables, leading to conic programming problems involving exponential or quadratic cones. Non-rectangular LpL_p uncertainty balls, while non-decomposable, can be expressed as unions of sasa-rectangular sets, and their dual solutions exhibit rank-one perturbation structure and tight policy evaluation bounds (Kumar et al., 13 Feb 2025).

5. Structural and Algorithmic Insights

MDP formulations enable a range of structural and computational developments:

  • Occupation measure LPs provide a unifying paradigm for finite-horizon, infinite-horizon, vector-valued, or risk-constrained problems (Ying et al., 2020, Dufour et al., 2019, Mifrani et al., 19 Feb 2025).
  • Duality and value/occupancy key: The primal (value-function) and dual (state-action occupancy) LPs are strongly dual; dual multipliers in the LP correspond to discounted visitation frequencies, and optimal policies are derived from primal-dual solutions.
  • Exact policy recovery: Complementary slackness in the LP directly yields deterministic or randomized policies—e.g., π(as)\pi^*(a|s) supported on actions tight in Bellman inequalities.
  • Dynamic programming and decomposition: Bellman fixed-point equations, multi-objective extensions, and scenario-based decompositions (e.g., for bilevel or robust MDPs) facilitate advanced synthesis and sensitivity analysis (Brown et al., 2023, Grand-Clément et al., 2022).
  • Handling large or infinite spaces: Measure-theoretic LPs extend MDP formulations to Borel spaces, with technical conditions (measurability, semicontinuity, Lyapunov growth) ensuring equivalence and solvability (Costa et al., 2014, Dufour et al., 2019).

6. Application-Driven MDP Formulations

The versatility of MDP formulation is reflected in diverse applications:

  • Continuous-time and hybrid models: PDMPs (piecewise deterministic Markov processes) use embedded discrete-time chains and infinite-dimensional LPs to capture systems with deterministic flows and stochastic jumps, relevant for stochastic hybrid systems and queuing networks (Costa et al., 2014).
  • Finite-horizon, multi-objective engineering: Vector-LP MDPs address multi-criteria design and scheduling, explicitly characterizing the Pareto-efficient policy set via occupation measures (Mifrani et al., 19 Feb 2025).
  • Resource allocation and social welfare: Water-filling convex programs for fair allocations under demand uncertainty leverage MDP structural properties to design policies with rigorous guarantees on fairness and efficiency (Hassanzadeh et al., 2023).
  • Learning and control in complex settings: MDP reformulations of video generation (Yushchenko et al., 2019), branch-and-bound variable selection (Strang et al., 22 Oct 2025), and satellite scheduling (Eddy et al., 2019) demonstrate the methodological breadth of MDP modeling, including integration with RL and tree search.

7. Theoretical and Computational Guarantees

The equivalences between MDP dynamic programming, LP, and policy-gradient formulations ensure convergence properties and optimality of synthesized policies under broad conditions:

  • Existence of an optimal policy, often stationary and deterministic (or suitably randomized), is guaranteed by strong duality and complementary slackness under mild regularity assumptions (Ying et al., 2020, Dufour et al., 2019, Mifrani et al., 19 Feb 2025).
  • Robust, risk-averse, and bilevel generalizations preserve computational tractability through advanced convex programming, conic formulations, and decomposition schemes, with provable performance bounds (Grand-Clément et al., 2022, Ahmadi et al., 2020, Brown et al., 2023).
  • Occupation measure and LP-based methods generalize seamlessly to constraints, multi-objective, risk, and robustness settings, underpinning both exact and scalable approximate solution algorithms across stochastic control and optimization domains.

References:

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Markov Decision Process Formulation.