Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 134 tok/s
Gemini 2.5 Pro 41 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 108 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 435 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

Constrained Markov Decision Processes

Updated 27 October 2025
  • Constrained MDPs are sequential decision processes that optimize discounted rewards while enforcing limits on cumulative costs or risks.
  • They incorporate Lyapunov-type conditions to prevent explosion and ensure finiteness in systems with unbounded transition rates and cost functions.
  • The formulation utilizes occupation measures to recast the dynamic control problem as a convex linear program for efficient computational solutions.

A constrained Markov decision process (MDP) is a sequential decision-making framework in which an agent seeks to optimize an expected performance criterion (such as discounted reward or long-run average reward) subject to additional constraints expressed as bounds on expected cost or risk metrics accrued over controlled stochastic trajectories. In contrast to standard, unconstrained MDPs, constrained MDPs address practical requirements in resource allocation, safety, fairness, or risk, and are foundational for advanced models in operations research, engineering, economics, and machine learning.

1. Formal Definition and Mathematical Framework

A constrained (continuous-time) MDP is typically specified by a tuple

{S, {A(x)A, xS}, q(x,a), r(x,a), (cn(x,a),dn)1nN}\{ S, \ \{A(x) \subseteq A,~x \in S\}, \ q(\cdot|x,a), \ r(x,a), \ (c_n(x,a), d_n)_{1 \leq n \leq N} \}

where:

  • SS is the state space (a Polish space: complete, separable, metric).
  • AA is the global action space; admissible actions at state xx are A(x)AA(x) \subseteq A.
  • q(x,a)q(\cdot|x,a) is the Borel-measurable transition rate kernel; for each (x,a)(x,a), q(x,a)q(\cdot|x,a) is a measure over SS satisfying q(Sx,a)=0q(S|x,a) = 0 and q(Dx,a)0q(D|x,a) \ge 0 for xDx \notin D, with a local boundedness property:

q(x):=supaA(x)q({x}x,a)< for all xS.q^*(x) := \sup_{a \in A(x)} |q(\{x\}|x,a)| < \infty~\text{for all}~x \in S.

  • r(x,a)r(x,a) is the (possibly unbounded, real-valued) reward function.
  • cn(x,a)c_n(x,a), n=1,,Nn = 1, \ldots, N, are real-valued cost functions with (possibly unbounded) range; dnd_n are upper bounds for expected discounted costs.
  • γ\gamma is the initial distribution on SS.

The objective is to select a policy π\pi (possibly history-dependent and randomized) maximizing the expected discounted reward

Vr(π)=SVα(x,π,r) γ(dx)V_r(\pi) = \int_S V_\alpha(x, \pi, r)~\gamma(dx)

with

Vα(x,π,u)=0eαtExπ[u(ξt,a)] dt,V_\alpha(x, \pi, u) = \int_0^\infty e^{-\alpha t} \mathbb{E}_x^\pi [u(\xi_{t^-}, a)]~dt,

while satisfying the discounted cost constraints

Vα(π,cn)=SVα(x,π,cn) γ(dx)dn,n=1,,N.V_\alpha(\pi, c_n) = \int_S V_\alpha(x, \pi, c_n)~\gamma(dx) \leq d_n,\quad n=1,\ldots,N.

The admissible class of policies is broad: arbitrary measurable, potentially history-dependent, randomized mappings from observed paths to actions.

2. Nonexplosion and Finiteness: Model Well-posedness

In continuous-time settings with unbounded transition rates and costs, model well-posedness—i.e., avoidance of process “explosion” (infinite jumps in finite time) and finiteness of expected rewards/costs—cannot be taken for granted. Sufficient conditions are provided via a Lyapunov-type inequality (“Assumption A”), requiring the existence of a continuous weighting function w1w \ge 1, constants ρ,b0\rho, b \ge 0, and an increasing sequence of measurable subsets SkS_k covering SS, such that for all (x,a)(x,a): Sw(y) q(dyx,a)ρw(x)+b.\int_S w(y)~q(dy|x,a) \leq \rho w(x) + b. Additionally, infxSkw(x)+\inf_{x\notin S_k} w(x) \to +\infty ensures “drift” towards compact subsets and precludes explosion. These conditions guarantee that under any admissible policy, the process (ξt)(\xi_t) has T=T_\infty = \infty a.s. (no explosion in finite time), while the occupation measure remains finite.

3. Occupation Measures and Problem Reduction

A pivotal feature is the reduction of the dynamic, constrained control problem to a static optimization over the space of occupation measures. For any policy π\pi, the occupation measure ηπ\eta^\pi on S×AS \times A is defined as: ηπ(D×Γ)=α0eαt Eγπ[IξtD π(Γe,t)] dt,\eta^\pi(D \times \Gamma) = \alpha \int_0^\infty e^{-\alpha t}~\mathbb{E}_\gamma^\pi[\, \mathbb{I}_{\xi_t \in D}~\pi(\Gamma | e, t) ]~dt, for measurable DSD \subseteq S, ΓA\Gamma \subseteq A. This measure quantifies (discounted) “weighted frequency” of encountering state-action pairs. The occupation measure satisfies a generalized balance equation: αη^π(D)=αγ(D)+S×Aq(Dx,a) ηπ(dx,da),\alpha \hat{\eta}^\pi (D) = \alpha \gamma(D) + \int_{S \times A} q(D|x,a)~\eta^\pi(dx, da), analogous to the global balance in continuous-time jump processes. The original constrained MDP, involving trajectories and histories, is thereby equivalently formulated as an optimization over a convex subset of probability measures.

4. Weighted Weak Topology and Existence of Solutions

For unbounded functions rr or cnc_n, the set of feasible occupation measures may not be compact in the standard weak topology. The paper introduces the wˉ\bar{w}–weak convergence: A sequence {ηk}\{\eta_k\} converges wˉ\bar{w}–weakly to η\eta if for every continuous function uu with u(x,a)Lwˉ(x)|u(x,a)| \leq L \bar{w}(x) for some LL, we have

limku(x,a)ηk(dx,da)=u(x,a)η(dx,da).\lim_{k\to\infty} \int u(x, a)\,\eta_k(dx, da) = \int u(x, a)\,\eta(dx, da).

This topology, strictly stronger than standard weak convergence, provides the relative compactness required to establish existence of an optimal constrained policy under weak regularity and growth conditions.

5. Linear Programming Reformulation and Computational Implications

The occupation measure reduction allows an explicit linear programming formulation. The constrained optimization problem becomes: Minimize     (1/α)S×Ac0(x,a) η(dx,da)  where  c0:=r subject to    S×Acn(x,a) η(dx,da)αdn,  n=1,,N αη^(D)=αγ(D)+q(Dx,a)η(dx,da),  D “small”\begin{align*} \text{Minimize}~~~~~ & (1/\alpha) \int_{S \times A} c_0(x,a)~\eta(dx,da)~~\text{where}~~c_0 := -r \ \text{subject to}~~~~ & \int_{S \times A} c_n(x,a)~\eta(dx, da) \leq \alpha d_n,~~n=1,\ldots,N \ & \alpha \hat{\eta}(D) = \alpha \gamma(D) + \int q(D|x,a)\,\eta(dx, da),~\forall~D~\text{“small”} \end{align*} This linear structure is central for both theoretical analysis and practical computation. In finite S,AS, A cases, the problem reduces to a finite-dimensional LP solvable by standard algorithms. In more general Polish spaces, the convex analytic structure facilitates characterization of solutions and supports constructive computational schemes for cases with approximating finite models.

6. Applicability: Examples and Explicit Policies

The framework allows for explicit treatment of multiple classes of constrained continuous-time models, including those with unbounded state/action spaces and unbounded cost/reward functions. For instance, models with S=RS = \mathbb{R}, action sets A(x)=[β0,β(x+1)]A(x) = [\beta_0, \beta(|x|+1)] and Gaussian (or more general) transition dynamics are addressed, even when quadratic or higher-order growth appears in r(x,a)r(x,a) or cn(x,a)c_n(x,a). Closed-form expressions for optimal occupation measures and stationary policies are provided in certain cases, demonstrating the constructive power of the theory.

7. Significance, Generality, and Impact

The described constrained MDP framework significantly generalizes existing theory by:

  • Allowing unbounded transition rates and cost/reward functions with only Lyapunov-type (nonexplosion) conditions.
  • Admitting Polish (non-finite, possibly infinite-dimensional) state and action spaces.
  • Accommodating randomized, history-dependent policies.
  • Reducing the constrained control objective to a convex program over measures and establishing an explicit equivalence to a linear program, thereby connecting stochastic process theory with convex analysis and mathematical programming.

Theoretical results, such as the existence and characterization of constrained-optimal stationary randomized policies and the feasibility of explicit linear programming solutions, apply broadly, encompassing classical bounded settings and extending applicability to models with unbounded and continuous dynamics or objectives. Computable examples demonstrate the practical implementation of the framework for complex continuous-time systems. This contrasts with previous approaches limited to bounded coefficients and finite or countable spaces.

This synthesis—merging nonexplosion analysis, occupation measure methods, advanced topological structures, and linear programming duality—provides a comprehensive, rigorously founded, and computationally tractable theory for constrained continuous-time MDPs on general state-action spaces (Guo et al., 2011).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Constrained Markov Decision Process (MDP).