Markov Decision Process Formulation

Updated 6 May 2026

MDP formulation is a mathematical framework that models sequential decision-making in stochastic environments using states, actions, probabilistic transitions, and rewards.
It employs occupancy measures and flow constraints to encapsulate decision policies, enabling the enumeration of Pareto-optimal deterministic strategies.
Advanced formulations extend to risk-sensitive, multi-objective, and entropy-regularized models, leveraging linear programming and scalable algorithms for efficient solutions.

A Markov Decision Process (MDP) provides a rigorous mathematical framework for modeling sequential decision-making in stochastic environments. It describes the interaction of a decision-maker, or agent, with an environment characterized by probabilistic state transitions and reward structures, enabling the formal synthesis and analysis of optimal policies under uncertainty. MDPs form the foundational model for many branches of dynamic optimization, reinforcement learning, and control theory.

1. Canonical MDP Structure and Notation

A classic MDP is defined as a tuple $(S, A, P, r, \gamma)$ , with the following components:

State space $S$ : A finite or Borel-measurable set representing all possible environment configurations.
Action space $A$ : A set of feasible control actions, possibly state-dependent with $A_s$ available in $s$ .
Transition kernels $P(s'|s,a)$ : The probability law for state evolution such that $P(s'|s,a) = \Pr\{X_{t+1} = s' | X_t = s, Y_t = a\}$ .
(Vector-valued) reward function $r_t(s,a) \in \mathbb{R}^k$ : The immediate (possibly vector-valued) reward when taking action $a$ in $s$ at time $S$ 0; terminal rewards $S$ 1 may also be specified for finite horizon problems.
Discount factor $S$ 2: (optional) Governs the trade-off between immediate and long-run rewards (infinite-horizon discounted MDPs).

Policies $S$ 3 are mappings specifying a probability distribution over actions for each state (possibly non-stationary or randomized). Expectations and objectives are formed with respect to these induced Markov chains.

2. Occupancy Measures, Linear Programming, and Flow Constraints

For both finite- and infinite-horizon settings, the action of a policy $S$ 4 can be algebraically encoded via occupancy measures:

$S$ 5

$S$ 6

Occupancy variables must satisfy coupled linear flow-balance equations:

At $S$ 7 (initial occupation):

$S$ 8

For $S$ 9:

$A$ 0

Terminal:

$A$ 1

Nonnegativity:

$A$ 2

The feasible set of $A$ 3 is the polyhedron $A$ 4 cut out by these (and policy-dependence yields a bijection $A$ 5). This construction is central for both single- and multi-objective MDPs (Mifrani et al., 19 Feb 2025).

3. Optimization Formulations: Scalar and Vector-valued Objectives

Classic infinite-horizon, discounted-reward MDPs admit primal and dual linear programming (LP) formulations (Ying et al., 2020):

Primal (value-function based):

$A$ 6

Dual (state-action occupation):

$A$ 7

Multiobjective (vector-valued): For $A$ 8 objectives, the total expected reward vector is

$A$ 9

Optimality is defined in terms of Pareto efficiency: no other $A_s$ 0 is at least as large in all components and strictly greater in one. The set of achievable vectors is $A_s$ 1, the image of the feasible polyhedron $A_s$ 2 under the reward linear map (Mifrani et al., 19 Feb 2025).

Entropy-regularized MDPs: Include additional negative entropy terms yielding soft-max Bellman/Fenchel dual constraints (Ying et al., 2020).

The vector LP formulation for a finite-horizon MDP with vector rewards is given by

$A_s$ 3

Pareto-efficient solutions correspond precisely to LP Pareto optima (no $A_s$ 4 with $A_s$ 5) (Mifrani et al., 19 Feb 2025).

4. Characterization and Enumeration of Optimal Policies

Extreme points of the polyhedron $A_s$ 6 (i.e., basic feasible solutions of the vector LP) correspond exactly to deterministic policies. Pareto efficiency in policy space coincides with Pareto efficiency in the occupancy variable space. An explicit graph search (ENUMEFFICIENT) traverses the adjacency structure of $A_s$ 7's vertices: starting from an initial efficient vertex (found by a scalarization), the algorithm pivots via adjacency (simplex steps), testing each adjacent vertex for efficiency. This process enumerates the complete set of Pareto-efficient deterministic policies (Mifrani et al., 19 Feb 2025).

Given a Pareto-optimal occupancy vector $A_s$ 8, the deterministic policy $A_s$ 9 is reconstructed componentwise by normalizing: $s$ 0 for each $s$ 1 and $s$ 2.

This algebraic approach allows exhaustive characterization of all efficient deterministic strategies for finite-horizon, multiobjective MDPs, facilitating both completeness and explicit policy synthesis.

5. Extensions and Advanced Problem Classes

The LP-based paradigm and occupancy-measure framework extend naturally to extensive classes of MDP variations:

Distributionally Robust and Risk-sensitive MDPs: Models with parameter uncertainty, e.g., unknown transition kernels and random rewards, admit generalized convex or conic programming formulations (SOCP, MISOCP, copositive, or biconvex) under suitable uncertainty sets (Nguyen et al., 2022, Lin et al., 2021, Grand-Clément et al., 2022).
Expectation and Constraints: Infinite-horizon expected-total-reward problems with Borel state-action spaces and constraints reduce to infinite-dimensional convex programs over generalized occupation measures, maintaining equivalence between convex-analytic and policy-based optima under mild regularity and Slater-type conditions (Dufour et al., 2019).
Belief-MDPs and Privacy: Partially observable (e.g., privacy-preserving data sharing) domains are reformulated as belief-space MDPs, with particle-based approximations yielding finite-state tractable surrogates (Yu et al., 4 Feb 2026).
Vector-valued and Multi-objective Control: Full characterization of trade-offs among conflicting objectives is afforded by the polyhedral structure of occupancy measures; efficient enumeration of all Pareto-optimal deterministic strategies becomes tractable (Mifrani et al., 19 Feb 2025).
Augmented and Exogenous Processes: Externally driven non-stationarity, e.g., MDPs with exogenous temporal processes, are rigorously addressed via history-augmented state spaces and policy iteration over finite-memory truncations, with explicit bounds on the suboptimality introduced by memory truncation (Ayyagari et al., 2023).

6. Practical Algorithms and Scalability

Solving large-scale MDPs requires leveraging compressed representations and scalable algorithms:

Tensor Decomposition: For high-dimensional finite MDPs, transition traffic is compressed via CP (CANDECOMP–PARAFAC) decomposition, reducing per-iteration and memory complexity by orders of magnitude and enabling solution of problems with over $s$ 3 states (Kuinchtner et al., 2021).
Policy Iteration with Adaptive Enhancements: Heuristics such as adaptive edge-pruning in policy iteration address sudden non-stationarity or non-Markovian exogenous processes (Biemer et al., 2023, Ayyagari et al., 2023).
Enumeration Algorithms: Graph traversal and simplex-pivoting enable full enumeration of the efficient deterministic policy set for multiobjective formulations (Mifrani et al., 19 Feb 2025).
Entropy Regularization and Risk Measures: Sophisticated problem classes admit solution via primal-dual convex conic programs, regularized Bellman operators, and recursive convex approximation techniques for risk-sensitive objectives (Mifrani et al., 19 Feb 2025, Lin et al., 2021, Grand-Clément et al., 2022).
Online and Approximate Scheduling: Practical low-complexity heuristics that closely track the LP-MDP optimum are derived by mapping predictive models (e.g., AR(1) for channel throughput) onto resource allocation policies (Chen et al., 2012).

7. Theoretical Foundations and Equivalence Results

Key theoretical guarantees underpin the MDP formulation:

The occupancy measure mapping is a bijection between regular policies and the flow-balance polyhedron (Mifrani et al., 19 Feb 2025).
All Pareto-efficient finite-horizon MDP policies are obtainable as vertices (extreme points) of the vector-LP feasible polyhedron.
In infinite-horizon average or total-reward settings, policy and convex-analytic optima coincide under mild compactness and regularity (Dufour et al., 2019).
For robust and risk-averse classes, entropic or risk-aggregation operations (e.g., CVaR, Wasserstein, $s$ 4-divergence) yield convex or conic programs with explicit solution structure (Lin et al., 2021, Nguyen et al., 2022, Grand-Clément et al., 2022).
Policy iteration and Bellman recursion structures are preserved under augmented state spaces, external processes, and belief-MDP reductions (Ayyagari et al., 2023, Yu et al., 4 Feb 2026).

The mathematical structure of the MDP formulation—spanning stochastic processes, convex geometry, and optimization—enables rigorous synthesis and complete characterization of optimal, and Pareto-optimal, decision strategies across a wide range of domains and theoretical extensions.