Constrained Markov Decision Processes

Updated 27 October 2025

Constrained MDPs are sequential decision processes that optimize discounted rewards while enforcing limits on cumulative costs or risks.
They incorporate Lyapunov-type conditions to prevent explosion and ensure finiteness in systems with unbounded transition rates and cost functions.
The formulation utilizes occupation measures to recast the dynamic control problem as a convex linear program for efficient computational solutions.

A constrained Markov decision process (MDP) is a sequential decision-making framework in which an agent seeks to optimize an expected performance criterion (such as discounted reward or long-run average reward) subject to additional constraints expressed as bounds on expected cost or risk metrics accrued over controlled stochastic trajectories. In contrast to standard, unconstrained MDPs, constrained MDPs address practical requirements in resource allocation, safety, fairness, or risk, and are foundational for advanced models in operations research, engineering, economics, and machine learning.

1. Formal Definition and Mathematical Framework

A constrained (continuous-time) MDP is typically specified by a tuple

$\{ S, \ \{A(x) \subseteq A,~x \in S\}, \ q(\cdot|x,a), \ r(x,a), \ (c_n(x,a), d_n)_{1 \leq n \leq N} \}$

where:

$S$ is the state space (a Polish space: complete, separable, metric).
$A$ is the global action space; admissible actions at state $x$ are $A(x) \subseteq A$ .
$q(\cdot|x,a)$ is the Borel-measurable transition rate kernel; for each $(x,a)$ , $q(\cdot|x,a)$ is a measure over $S$ satisfying $q(S|x,a) = 0$ and $q(D|x,a) \ge 0$ for $x \notin D$ , with a local boundedness property:

$q^*(x) := \sup_{a \in A(x)} |q(\{x\}|x,a)| < \infty~\text{for all}~x \in S.$

$r(x,a)$ is the (possibly unbounded, real-valued) reward function.
$c_n(x,a)$ , $n = 1, \ldots, N$ , are real-valued cost functions with (possibly unbounded) range; $d_n$ are upper bounds for expected discounted costs.
$\gamma$ is the initial distribution on $S$ .

The objective is to select a policy $\pi$ (possibly history-dependent and randomized) maximizing the expected discounted reward

$V_r(\pi) = \int_S V_\alpha(x, \pi, r)~\gamma(dx)$

with

$V_\alpha(x, \pi, u) = \int_0^\infty e^{-\alpha t} \mathbb{E}_x^\pi [u(\xi_{t^-}, a)]~dt,$

while satisfying the discounted cost constraints

$V_\alpha(\pi, c_n) = \int_S V_\alpha(x, \pi, c_n)~\gamma(dx) \leq d_n,\quad n=1,\ldots,N.$

The admissible class of policies is broad: arbitrary measurable, potentially history-dependent, randomized mappings from observed paths to actions.

2. Nonexplosion and Finiteness: Model Well-posedness

In continuous-time settings with unbounded transition rates and costs, model well-posedness—i.e., avoidance of process “explosion” (infinite jumps in finite time) and finiteness of expected rewards/costs—cannot be taken for granted. Sufficient conditions are provided via a Lyapunov-type inequality (“Assumption A”), requiring the existence of a continuous weighting function $w \ge 1$ , constants $\rho, b \ge 0$ , and an increasing sequence of measurable subsets $S_k$ covering $S$ , such that for all $(x,a)$ : $\int_S w(y)~q(dy|x,a) \leq \rho w(x) + b.$ Additionally, $\inf_{x\notin S_k} w(x) \to +\infty$ ensures “drift” towards compact subsets and precludes explosion. These conditions guarantee that under any admissible policy, the process $(\xi_t)$ has $T_\infty = \infty$ a.s. (no explosion in finite time), while the occupation measure remains finite.

3. Occupation Measures and Problem Reduction

A pivotal feature is the reduction of the dynamic, constrained control problem to a static optimization over the space of occupation measures. For any policy $\pi$ , the occupation measure $\eta^\pi$ on $S \times A$ is defined as: $\eta^\pi(D \times \Gamma) = \alpha \int_0^\infty e^{-\alpha t}~\mathbb{E}_\gamma^\pi[\, \mathbb{I}_{\xi_t \in D}~\pi(\Gamma | e, t) ]~dt,$ for measurable $D \subseteq S$ , $\Gamma \subseteq A$ . This measure quantifies (discounted) “weighted frequency” of encountering state-action pairs. The occupation measure satisfies a generalized balance equation: $\alpha \hat{\eta}^\pi (D) = \alpha \gamma(D) + \int_{S \times A} q(D|x,a)~\eta^\pi(dx, da),$ analogous to the global balance in continuous-time jump processes. The original constrained MDP, involving trajectories and histories, is thereby equivalently formulated as an optimization over a convex subset of probability measures.

4. Weighted Weak Topology and Existence of Solutions

For unbounded functions $r$ or $c_n$ , the set of feasible occupation measures may not be compact in the standard weak topology. The paper introduces the $\bar{w}$ –weak convergence: A sequence $\{\eta_k\}$ converges $\bar{w}$ –weakly to $\eta$ if for every continuous function $u$ with $|u(x,a)| \leq L \bar{w}(x)$ for some $L$ , we have

$\lim_{k\to\infty} \int u(x, a)\,\eta_k(dx, da) = \int u(x, a)\,\eta(dx, da).$

This topology, strictly stronger than standard weak convergence, provides the relative compactness required to establish existence of an optimal constrained policy under weak regularity and growth conditions.

5. Linear Programming Reformulation and Computational Implications

The occupation measure reduction allows an explicit linear programming formulation. The constrained optimization problem becomes: $\begin{align*} \text{Minimize}~~~~~ & (1/\alpha) \int_{S \times A} c_0(x,a)~\eta(dx,da)~~\text{where}~~c_0 := -r \ \text{subject to}~~~~ & \int_{S \times A} c_n(x,a)~\eta(dx, da) \leq \alpha d_n,~~n=1,\ldots,N \ & \alpha \hat{\eta}(D) = \alpha \gamma(D) + \int q(D|x,a)\,\eta(dx, da),~\forall~D~\text{“small”} \end{align*}$ This linear structure is central for both theoretical analysis and practical computation. In finite $S, A$ cases, the problem reduces to a finite-dimensional LP solvable by standard algorithms. In more general Polish spaces, the convex analytic structure facilitates characterization of solutions and supports constructive computational schemes for cases with approximating finite models.

6. Applicability: Examples and Explicit Policies

The framework allows for explicit treatment of multiple classes of constrained continuous-time models, including those with unbounded state/action spaces and unbounded cost/reward functions. For instance, models with $S = \mathbb{R}$ , action sets $A(x) = [\beta_0, \beta(|x|+1)]$ and Gaussian (or more general) transition dynamics are addressed, even when quadratic or higher-order growth appears in $r(x,a)$ or $c_n(x,a)$ . Closed-form expressions for optimal occupation measures and stationary policies are provided in certain cases, demonstrating the constructive power of the theory.

7. Significance, Generality, and Impact

The described constrained MDP framework significantly generalizes existing theory by:

Allowing unbounded transition rates and cost/reward functions with only Lyapunov-type (nonexplosion) conditions.
Admitting Polish (non-finite, possibly infinite-dimensional) state and action spaces.
Accommodating randomized, history-dependent policies.
Reducing the constrained control objective to a convex program over measures and establishing an explicit equivalence to a linear program, thereby connecting stochastic process theory with convex analysis and mathematical programming.

Theoretical results, such as the existence and characterization of constrained-optimal stationary randomized policies and the feasibility of explicit linear programming solutions, apply broadly, encompassing classical bounded settings and extending applicability to models with unbounded and continuous dynamics or objectives. Computable examples demonstrate the practical implementation of the framework for complex continuous-time systems. This contrasts with previous approaches limited to bounded coefficients and finite or countable spaces.

This synthesis—merging nonexplosion analysis, occupation measure methods, advanced topological structures, and linear programming duality—provides a comprehensive, rigorously founded, and computationally tractable theory for constrained continuous-time MDPs on general state-action spaces (Guo et al., 2011).

PDF Markdown Chat (Pro)

References (1)

Discounted continuous-time constrained Markov decision processes in Polish spaces (2011)

Follow Topic

Get notified by email when new papers are published related to Constrained Markov Decision Process (MDP).