Papers
Topics
Authors
Recent
Search
2000 character limit reached

Markov Decision Process Formulation

Updated 6 May 2026
  • MDP formulation is a mathematical framework that models sequential decision-making in stochastic environments using states, actions, probabilistic transitions, and rewards.
  • It employs occupancy measures and flow constraints to encapsulate decision policies, enabling the enumeration of Pareto-optimal deterministic strategies.
  • Advanced formulations extend to risk-sensitive, multi-objective, and entropy-regularized models, leveraging linear programming and scalable algorithms for efficient solutions.

A Markov Decision Process (MDP) provides a rigorous mathematical framework for modeling sequential decision-making in stochastic environments. It describes the interaction of a decision-maker, or agent, with an environment characterized by probabilistic state transitions and reward structures, enabling the formal synthesis and analysis of optimal policies under uncertainty. MDPs form the foundational model for many branches of dynamic optimization, reinforcement learning, and control theory.

1. Canonical MDP Structure and Notation

A classic MDP is defined as a tuple (S,A,P,r,γ)(S, A, P, r, \gamma), with the following components:

  • State space SS: A finite or Borel-measurable set representing all possible environment configurations.
  • Action space AA: A set of feasible control actions, possibly state-dependent with AsA_s available in ss.
  • Transition kernels P(ss,a)P(s'|s,a): The probability law for state evolution such that P(ss,a)=Pr{Xt+1=sXt=s,Yt=a}P(s'|s,a) = \Pr\{X_{t+1} = s' | X_t = s, Y_t = a\}.
  • (Vector-valued) reward function rt(s,a)Rkr_t(s,a) \in \mathbb{R}^k: The immediate (possibly vector-valued) reward when taking action aa in ss at time SS0; terminal rewards SS1 may also be specified for finite horizon problems.
  • Discount factor SS2: (optional) Governs the trade-off between immediate and long-run rewards (infinite-horizon discounted MDPs).

Policies SS3 are mappings specifying a probability distribution over actions for each state (possibly non-stationary or randomized). Expectations and objectives are formed with respect to these induced Markov chains.

2. Occupancy Measures, Linear Programming, and Flow Constraints

For both finite- and infinite-horizon settings, the action of a policy SS4 can be algebraically encoded via occupancy measures:

SS5

SS6

Occupancy variables must satisfy coupled linear flow-balance equations:

  • At SS7 (initial occupation):

SS8

  • For SS9:

AA0

  • Terminal:

AA1

  • Nonnegativity:

AA2

The feasible set of AA3 is the polyhedron AA4 cut out by these (and policy-dependence yields a bijection AA5). This construction is central for both single- and multi-objective MDPs (Mifrani et al., 19 Feb 2025).

3. Optimization Formulations: Scalar and Vector-valued Objectives

Classic infinite-horizon, discounted-reward MDPs admit primal and dual linear programming (LP) formulations (Ying et al., 2020):

  • Primal (value-function based):

AA6

  • Dual (state-action occupation):

AA7

  • Multiobjective (vector-valued): For AA8 objectives, the total expected reward vector is

AA9

Optimality is defined in terms of Pareto efficiency: no other AsA_s0 is at least as large in all components and strictly greater in one. The set of achievable vectors is AsA_s1, the image of the feasible polyhedron AsA_s2 under the reward linear map (Mifrani et al., 19 Feb 2025).

  • Entropy-regularized MDPs: Include additional negative entropy terms yielding soft-max Bellman/Fenchel dual constraints (Ying et al., 2020).

The vector LP formulation for a finite-horizon MDP with vector rewards is given by

AsA_s3

Pareto-efficient solutions correspond precisely to LP Pareto optima (no AsA_s4 with AsA_s5) (Mifrani et al., 19 Feb 2025).

4. Characterization and Enumeration of Optimal Policies

Extreme points of the polyhedron AsA_s6 (i.e., basic feasible solutions of the vector LP) correspond exactly to deterministic policies. Pareto efficiency in policy space coincides with Pareto efficiency in the occupancy variable space. An explicit graph search (ENUMEFFICIENT) traverses the adjacency structure of AsA_s7's vertices: starting from an initial efficient vertex (found by a scalarization), the algorithm pivots via adjacency (simplex steps), testing each adjacent vertex for efficiency. This process enumerates the complete set of Pareto-efficient deterministic policies (Mifrani et al., 19 Feb 2025).

Given a Pareto-optimal occupancy vector AsA_s8, the deterministic policy AsA_s9 is reconstructed componentwise by normalizing: ss0 for each ss1 and ss2.

This algebraic approach allows exhaustive characterization of all efficient deterministic strategies for finite-horizon, multiobjective MDPs, facilitating both completeness and explicit policy synthesis.

5. Extensions and Advanced Problem Classes

The LP-based paradigm and occupancy-measure framework extend naturally to extensive classes of MDP variations:

  • Distributionally Robust and Risk-sensitive MDPs: Models with parameter uncertainty, e.g., unknown transition kernels and random rewards, admit generalized convex or conic programming formulations (SOCP, MISOCP, copositive, or biconvex) under suitable uncertainty sets (Nguyen et al., 2022, Lin et al., 2021, Grand-Clément et al., 2022).
  • Expectation and Constraints: Infinite-horizon expected-total-reward problems with Borel state-action spaces and constraints reduce to infinite-dimensional convex programs over generalized occupation measures, maintaining equivalence between convex-analytic and policy-based optima under mild regularity and Slater-type conditions (Dufour et al., 2019).
  • Belief-MDPs and Privacy: Partially observable (e.g., privacy-preserving data sharing) domains are reformulated as belief-space MDPs, with particle-based approximations yielding finite-state tractable surrogates (Yu et al., 4 Feb 2026).
  • Vector-valued and Multi-objective Control: Full characterization of trade-offs among conflicting objectives is afforded by the polyhedral structure of occupancy measures; efficient enumeration of all Pareto-optimal deterministic strategies becomes tractable (Mifrani et al., 19 Feb 2025).
  • Augmented and Exogenous Processes: Externally driven non-stationarity, e.g., MDPs with exogenous temporal processes, are rigorously addressed via history-augmented state spaces and policy iteration over finite-memory truncations, with explicit bounds on the suboptimality introduced by memory truncation (Ayyagari et al., 2023).

6. Practical Algorithms and Scalability

Solving large-scale MDPs requires leveraging compressed representations and scalable algorithms:

  • Tensor Decomposition: For high-dimensional finite MDPs, transition traffic is compressed via CP (CANDECOMP–PARAFAC) decomposition, reducing per-iteration and memory complexity by orders of magnitude and enabling solution of problems with over ss3 states (Kuinchtner et al., 2021).
  • Policy Iteration with Adaptive Enhancements: Heuristics such as adaptive edge-pruning in policy iteration address sudden non-stationarity or non-Markovian exogenous processes (Biemer et al., 2023, Ayyagari et al., 2023).
  • Enumeration Algorithms: Graph traversal and simplex-pivoting enable full enumeration of the efficient deterministic policy set for multiobjective formulations (Mifrani et al., 19 Feb 2025).
  • Entropy Regularization and Risk Measures: Sophisticated problem classes admit solution via primal-dual convex conic programs, regularized Bellman operators, and recursive convex approximation techniques for risk-sensitive objectives (Mifrani et al., 19 Feb 2025, Lin et al., 2021, Grand-Clément et al., 2022).
  • Online and Approximate Scheduling: Practical low-complexity heuristics that closely track the LP-MDP optimum are derived by mapping predictive models (e.g., AR(1) for channel throughput) onto resource allocation policies (Chen et al., 2012).

7. Theoretical Foundations and Equivalence Results

Key theoretical guarantees underpin the MDP formulation:

  • The occupancy measure mapping is a bijection between regular policies and the flow-balance polyhedron (Mifrani et al., 19 Feb 2025).
  • All Pareto-efficient finite-horizon MDP policies are obtainable as vertices (extreme points) of the vector-LP feasible polyhedron.
  • In infinite-horizon average or total-reward settings, policy and convex-analytic optima coincide under mild compactness and regularity (Dufour et al., 2019).
  • For robust and risk-averse classes, entropic or risk-aggregation operations (e.g., CVaR, Wasserstein, ss4-divergence) yield convex or conic programs with explicit solution structure (Lin et al., 2021, Nguyen et al., 2022, Grand-Clément et al., 2022).
  • Policy iteration and Bellman recursion structures are preserved under augmented state spaces, external processes, and belief-MDP reductions (Ayyagari et al., 2023, Yu et al., 4 Feb 2026).

The mathematical structure of the MDP formulation—spanning stochastic processes, convex geometry, and optimization—enables rigorous synthesis and complete characterization of optimal, and Pareto-optimal, decision strategies across a wide range of domains and theoretical extensions.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Markov Decision Process (MDP) Formulation.