Discrete Markov Decision Process
- Discrete MDP is a formal framework that defines sequential decision-making under uncertainty with countable state and action spaces.
- It prescribes optimal policies by maximizing cumulative rewards over potentially infinite horizons via stochastic transitions and discounting.
- Recent advances include convex programming formulations, non-deterministic policy strategies, and mean-field approximations for large-scale systems.
A discrete Markov decision process (MDP) is a mathematical formalism for modeling sequential decision-making under uncertainty, where a controller observes the current state of a dynamic system, selects an admissible action, and the system evolves in a stochastic manner according to transition probabilities. The solution of an MDP prescribes a policy that optimizes a reward criterion over a potentially infinite time horizon, often subject to constraints. Discrete MDPs structure the state and action spaces as countable sets, though more general formulations allow for Borel subsets of metric spaces, with a stochastic kernel governing transitions. Recent advances address convex-analytic formulations, non-deterministic policies for human-in-the-loop contexts, and convergence to continuous optimization via mean-field approximations.
1. Formal Definition and Structure
A discrete MDP is commonly defined as a 5-tuple , where:
- : finite or countable state space;
- : finite or countable action space, with admissible in state ;
- : transition kernel specifying the probability of entering state given current state and action ;
- : expected immediate reward from 0;
- 1: discount factor for future rewards.
At each stage 2 the controller selects action 3, the state transitions to 4, and an immediate reward 5 is accrued. The objective is to maximize an aggregate reward criterion, such as the expected total reward 6 under policy 7 (Fard et al., 2014).
Generalizing beyond finite spaces, 8 denotes the system where 9, 0 are Borel subsets of complete, separable metric spaces, 1 is the transition kernel on 2, and 3 is a possibly signed reward. Admissible actions 4 are specified per state. Policies can be general (history-dependent) or stationary randomized, the latter mapping current state to a randomized action distribution (Dufour et al., 2019).
2. Optimality Criteria and Policy Classes
The most common performance criterion is the (discounted or undiscounted) expected total reward (ETR):
5
Constraints on cumulative costs, 6, may also be imposed, leading to constrained MDPs (Dufour et al., 2019).
Policy spaces admit various subclasses:
- Deterministic policies: mapping states to a single action.
- Stationary randomized policies: at each state, select action according to a fixed probability kernel 7, independent of history.
- Non-deterministic policies: mapping each state to a nonempty subset of admissible actions 8, allowing for agent or human selection among allowed actions at execution time (Fard et al., 2014).
A foundational result (Schäl's theorem) asserts that, under continuity-compactness and finiteness conditions, the supremum of 9 over all admissible randomized policies equals that over stationary randomized policies for upper-semicontinuous reward/cost functions (Dufour et al., 2019).
3. Convex-Analytic and Occupation Measure Formulations
Convex programming provides a powerful framework for discrete-time MDPs under ETR, extending to Borel state/action spaces, signed rewards/costs, and multiple constraints (Dufour et al., 2019). The core construct is the occupation measure 0:
1
which satisfies a balance (characteristic) equation.
Introducing pairs of nonnegative kernels 2 allows representation of measures suitable for signed costs. The convex program (CP) is:
- maximize 3
- subject to 4 for constraints,
- 5, where 6 encodes the balance equation and sign conditions.
Key results establish equivalence of the optimal constrained control value and the convex program optimum. Any optimizer for CP induces an optimal stationary randomized policy
7
on the support set. The kernel-pair (CP) formulation generalizes previous LP approaches, supports signed costs, and weakens regularity/absolute continuity requirements (Dufour et al., 2019).
4. Non-Deterministic Policy Frameworks
Standard MDP algorithms yield deterministic or randomized policies prescribing unique or fixed probability distributions over actions. For applications requiring greater adaptivity—such as clinical or assistive decision support—non-deterministic policies map states to sets of actions. The controller or user then selects within this set at each step.
For a non-deterministic policy 8, the value under worst-case selection is
9
An 0-optimal non-deterministic policy satisfies 1, ensuring that all action selections remain near-optimal in the worst case (Fard et al., 2014).
Algorithmically, non-deterministic policies maximizing the total number of allowed actions while satisfying 2-optimality are computed via:
- Mixed-integer programming (MIP) over binary action inclusion variables;
- Monotonic, depth-first recursive search exploiting the property that supersets of infeasible action sets cannot satisfy the bound.
Empirical studies show that even with tight optimality tolerances, users receive multiple near-optimal actions per state, enhancing flexibility without significant loss in performance (Fard et al., 2014).
5. Mean-Field and Scaling Limits
In large-scale systems consisting of a population of 3 objects, each following symmetric transition rules, discrete MDPs can be approximated by deterministic optimal control of an ODE—the mean-field limit. The system state is described by the empirical measure 4.
As 5, the evolution under properly scaled controls converges (with explicit error bounds) to the solution of the controlled ODE 6, optimizing the cost 7 (Gast et al., 2010). The value function solves a finite-horizon HJB PDE:
8
Approximating discrete MDP policies can be constructed:
- By computing the limiting drift and the HJB optimal feedback control, then instantiating it in the finite system;
- By resetting the ODE initial condition upon each discrete state observation and recomputing the optimal control online.
Numerical experiments confirm that mean-field derived policies are asymptotically optimal, and even for moderate 9, yield near-optimal performance (Gast et al., 2010).
6. Empirical Applications and Illustrative Examples
Empirical evaluations underscore the practical implications of discrete MDP frameworks:
- In medical sequential treatment planning (MDP with 19 actions), non-deterministic policies deliver actionable near-optimal sets even for tight optimality margins, supporting individualized care under efficacy and side-effect constraints (Fard et al., 2014).
- Web navigation experiments with human subjects demonstrate that providing users with hints based on non-deterministic policies leads to significant improvements in task completion speed and success rates compared to deterministic or unguided strategies.
- Population-level epidemic models with discrete agents exhibit that applying the mean-field bang-bang control policy in the full MDP achieves system-level objectives within 0 of the discrete optimum, substantially outperforming constant-parameter heuristics (Gast et al., 2010).
- Constraint-handling examples highlight that classical occupation-measure-based LP approaches can admit spurious (“phantom”) infinite-reward solutions if signed costs are present, while the convex-analytic kernel formulation avoids these pathologies and supports broader cost structures (Dufour et al., 2019).
7. Research Directions and Applications
The discrete MDP formalism underpins a vast array of theoretical and practical research:
- Convex-analytic approaches expand model expressiveness, accommodate weaker regularity conditions, and enable solution equivalence between primal/dual formulations.
- Non-deterministic policies provide rigorous worst-case guarantees for flexible, human-in-the-loop, or robust planning environments.
- Mean-field scaling offers computationally tractable approximations for extremely large MDP instances, linking discrete control with continuum optimal control and PDE methods.
Active domains of application include medical decision support, autonomous and semi-autonomous systems, financial decision systems with embedded human oversight, and large-scale resource allocation (Gast et al., 2010, Fard et al., 2014, Dufour et al., 2019).
References:
- "A convex programming approach for discrete-time Markov decision processes under the expected total reward criterion" (Dufour et al., 2019)
- "Non-Deterministic Policies in Markovian Decision Processes" (Fard et al., 2014)
- "Mean field for Markov Decision Processes: from Discrete to Continuous Optimization" (Gast et al., 2010)