Papers
Topics
Authors
Recent
Search
2000 character limit reached

Discrete Markov Decision Process

Updated 2 June 2026
  • Discrete MDP is a formal framework that defines sequential decision-making under uncertainty with countable state and action spaces.
  • It prescribes optimal policies by maximizing cumulative rewards over potentially infinite horizons via stochastic transitions and discounting.
  • Recent advances include convex programming formulations, non-deterministic policy strategies, and mean-field approximations for large-scale systems.

A discrete Markov decision process (MDP) is a mathematical formalism for modeling sequential decision-making under uncertainty, where a controller observes the current state of a dynamic system, selects an admissible action, and the system evolves in a stochastic manner according to transition probabilities. The solution of an MDP prescribes a policy that optimizes a reward criterion over a potentially infinite time horizon, often subject to constraints. Discrete MDPs structure the state and action spaces as countable sets, though more general formulations allow for Borel subsets of metric spaces, with a stochastic kernel governing transitions. Recent advances address convex-analytic formulations, non-deterministic policies for human-in-the-loop contexts, and convergence to continuous optimization via mean-field approximations.

1. Formal Definition and Structure

A discrete MDP is commonly defined as a 5-tuple (S,A,P,R,γ)(S, A, P, R, \gamma), where:

  • SS: finite or countable state space;
  • AA: finite or countable action space, with A(s)AA(s) \subset A admissible in state ss;
  • P(ss,a)P(s'|s,a): transition kernel specifying the probability of entering state ss' given current state ss and action aa;
  • R(s,a)R(s,a): expected immediate reward from SS0;
  • SS1: discount factor for future rewards.

At each stage SS2 the controller selects action SS3, the state transitions to SS4, and an immediate reward SS5 is accrued. The objective is to maximize an aggregate reward criterion, such as the expected total reward SS6 under policy SS7 (Fard et al., 2014).

Generalizing beyond finite spaces, SS8 denotes the system where SS9, AA0 are Borel subsets of complete, separable metric spaces, AA1 is the transition kernel on AA2, and AA3 is a possibly signed reward. Admissible actions AA4 are specified per state. Policies can be general (history-dependent) or stationary randomized, the latter mapping current state to a randomized action distribution (Dufour et al., 2019).

2. Optimality Criteria and Policy Classes

The most common performance criterion is the (discounted or undiscounted) expected total reward (ETR):

AA5

Constraints on cumulative costs, AA6, may also be imposed, leading to constrained MDPs (Dufour et al., 2019).

Policy spaces admit various subclasses:

  • Deterministic policies: mapping states to a single action.
  • Stationary randomized policies: at each state, select action according to a fixed probability kernel AA7, independent of history.
  • Non-deterministic policies: mapping each state to a nonempty subset of admissible actions AA8, allowing for agent or human selection among allowed actions at execution time (Fard et al., 2014).

A foundational result (Schäl's theorem) asserts that, under continuity-compactness and finiteness conditions, the supremum of AA9 over all admissible randomized policies equals that over stationary randomized policies for upper-semicontinuous reward/cost functions (Dufour et al., 2019).

3. Convex-Analytic and Occupation Measure Formulations

Convex programming provides a powerful framework for discrete-time MDPs under ETR, extending to Borel state/action spaces, signed rewards/costs, and multiple constraints (Dufour et al., 2019). The core construct is the occupation measure A(s)AA(s) \subset A0:

A(s)AA(s) \subset A1

which satisfies a balance (characteristic) equation.

Introducing pairs of nonnegative kernels A(s)AA(s) \subset A2 allows representation of measures suitable for signed costs. The convex program (CP) is:

  • maximize A(s)AA(s) \subset A3
  • subject to A(s)AA(s) \subset A4 for constraints,
  • A(s)AA(s) \subset A5, where A(s)AA(s) \subset A6 encodes the balance equation and sign conditions.

Key results establish equivalence of the optimal constrained control value and the convex program optimum. Any optimizer for CP induces an optimal stationary randomized policy

A(s)AA(s) \subset A7

on the support set. The kernel-pair (CP) formulation generalizes previous LP approaches, supports signed costs, and weakens regularity/absolute continuity requirements (Dufour et al., 2019).

4. Non-Deterministic Policy Frameworks

Standard MDP algorithms yield deterministic or randomized policies prescribing unique or fixed probability distributions over actions. For applications requiring greater adaptivity—such as clinical or assistive decision support—non-deterministic policies map states to sets of actions. The controller or user then selects within this set at each step.

For a non-deterministic policy A(s)AA(s) \subset A8, the value under worst-case selection is

A(s)AA(s) \subset A9

An ss0-optimal non-deterministic policy satisfies ss1, ensuring that all action selections remain near-optimal in the worst case (Fard et al., 2014).

Algorithmically, non-deterministic policies maximizing the total number of allowed actions while satisfying ss2-optimality are computed via:

  • Mixed-integer programming (MIP) over binary action inclusion variables;
  • Monotonic, depth-first recursive search exploiting the property that supersets of infeasible action sets cannot satisfy the bound.

Empirical studies show that even with tight optimality tolerances, users receive multiple near-optimal actions per state, enhancing flexibility without significant loss in performance (Fard et al., 2014).

5. Mean-Field and Scaling Limits

In large-scale systems consisting of a population of ss3 objects, each following symmetric transition rules, discrete MDPs can be approximated by deterministic optimal control of an ODE—the mean-field limit. The system state is described by the empirical measure ss4.

As ss5, the evolution under properly scaled controls converges (with explicit error bounds) to the solution of the controlled ODE ss6, optimizing the cost ss7 (Gast et al., 2010). The value function solves a finite-horizon HJB PDE:

ss8

Approximating discrete MDP policies can be constructed:

  • By computing the limiting drift and the HJB optimal feedback control, then instantiating it in the finite system;
  • By resetting the ODE initial condition upon each discrete state observation and recomputing the optimal control online.

Numerical experiments confirm that mean-field derived policies are asymptotically optimal, and even for moderate ss9, yield near-optimal performance (Gast et al., 2010).

6. Empirical Applications and Illustrative Examples

Empirical evaluations underscore the practical implications of discrete MDP frameworks:

  • In medical sequential treatment planning (MDP with 19 actions), non-deterministic policies deliver actionable near-optimal sets even for tight optimality margins, supporting individualized care under efficacy and side-effect constraints (Fard et al., 2014).
  • Web navigation experiments with human subjects demonstrate that providing users with hints based on non-deterministic policies leads to significant improvements in task completion speed and success rates compared to deterministic or unguided strategies.
  • Population-level epidemic models with discrete agents exhibit that applying the mean-field bang-bang control policy in the full MDP achieves system-level objectives within P(ss,a)P(s'|s,a)0 of the discrete optimum, substantially outperforming constant-parameter heuristics (Gast et al., 2010).
  • Constraint-handling examples highlight that classical occupation-measure-based LP approaches can admit spurious (“phantom”) infinite-reward solutions if signed costs are present, while the convex-analytic kernel formulation avoids these pathologies and supports broader cost structures (Dufour et al., 2019).

7. Research Directions and Applications

The discrete MDP formalism underpins a vast array of theoretical and practical research:

  • Convex-analytic approaches expand model expressiveness, accommodate weaker regularity conditions, and enable solution equivalence between primal/dual formulations.
  • Non-deterministic policies provide rigorous worst-case guarantees for flexible, human-in-the-loop, or robust planning environments.
  • Mean-field scaling offers computationally tractable approximations for extremely large MDP instances, linking discrete control with continuum optimal control and PDE methods.

Active domains of application include medical decision support, autonomous and semi-autonomous systems, financial decision systems with embedded human oversight, and large-scale resource allocation (Gast et al., 2010, Fard et al., 2014, Dufour et al., 2019).


References:

  • "A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion" (Dufour et al., 2019)
  • "Non-Deterministic Policies in Markovian Decision Processes" (Fard et al., 2014)
  • "Mean field for Markov Decision Processes: from Discrete to Continuous Optimization" (Gast et al., 2010)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Discrete Markov Decision Process.