Continuous-Time Markov Decision Process

Updated 16 August 2025

CTMDP is a stochastic control model that integrates continuous-time dynamics with decision-induced transitions, essential for analyzing system behaviors under uncertainty.
The framework employs uniformization to convert continuous-time models into discrete equivalents, streamlining the computation of time-bounded reachability probabilities.
Optimal scheduling policies in CTMDPs use finite-memory preambles followed by memoryless greedy strategies, ensuring effective solutions in reliability, manufacturing, and cyber-physical applications.

A continuous-time Markov decision process (CTMDP) is a stochastic control model that describes the evolution of systems where transitions between states occur randomly in continuous time, with transition rates and transition probabilities modulated by a decision maker's actions. CTMDPs generalize continuous-time Markov chains by introducing nondeterminism through controlled actions, providing a foundational framework for modeling, analyzing, and optimizing decision-making in complex temporal and stochastic environments. CTMDPs are central tools in reliability theory, dependability analysis, manufacturing, queueing systems, cyber-physical systems, and formal verification under real-time and uncertainty constraints.

1. Formal Definition and Core Structure

A CTMDP is formally defined as a tuple $(L, \mathrm{Act}, R, \nu, B)$ , where:

$L$ is a (countable or continuous, often Polish) state space;
$\mathrm{Act}$ specifies, for each state $l$ , the set of admissible actions $\mathrm{Act}(l)$ ;
$R: L \times \mathrm{Act} \times L \to [0, \infty)$ is the transition rate kernel, where $R(l,a,l')$ gives the rate of transitioning from state $l$ to $l'$ under action $a$ ;
$\nu$ is an initial state distribution;
$B \subseteq L$ may denote a designated goal region (for reachability analysis).

At each time, the decision maker selects an action $a \in \mathrm{Act}(l)$ at state $l$ , which determines the sojourn time (typically exponentially distributed with total rate $\sum_{l'} R(l,a,l')$ ) and the jump distribution $P(l, a, \cdot)$ over successor states, where $P(l,a,l') = R(l,a,l') / \lambda(l,a)$ and $\lambda(l,a) = \sum_{l'} R(l,a,l')$ .

Scheduler types reflect differing information patterns, including:

History-dependent (H): Schedulers as mappings $\mathcal{P}aths() \to$ Decisions;
Counting or hop-counting (C): Schedulers based on $(L,\mathbb{N})$ ;
Memoryless/positional (P): Schedulers as functions $L \to$ Decisions.

Each action may be selected deterministically or as a probability distribution (randomized strategies).

2. Time-Bounded Reachability and Uniformization Techniques

Time-bounded reachability problems, central in dependability and system performance analysis, seek the maximal probability that a trajectory reaches a goal set $B$ within a time-bound $t$ . For a CTMDP $M = (L, \mathrm{Act}, R, \nu, B)$ and a scheduler $\mathcal{S}$ , the probability of reaching $B$ in time $t$ is

$\mathrm{Pr}^{\mathcal{S}}(t) = \sum_{l \in L} \nu(l) \cdot \mathrm{Pr}^{\mathcal{S}}(l, t),$

where $\mathrm{Pr}^{\mathcal{S}}(l, t)$ is calculated recursively according to the chosen action and the time evolution.

In uniform CTMDPs, where all actions have the same exit rate $\lambda$ , the time-bound problem has a canonical reduction via uniformization: the number of discrete transitions by time $t$ is Poisson distributed, i.e.,

$p_{(\lambda t)}(n) = \frac{e^{-\lambda t} (\lambda t)^n}{n!},$

so the reachability probability can be written as

$\mathrm{Pr}^{\mathcal{S}}(t) = \sum_{l \in L} \nu(l) \sum_{n=0}^{\infty} d_l[n] \cdot p_{(\lambda t)}(n),$

where $d_l[n]$ is the step probability vector—the probability to reach $B$ in at most $n$ steps from $l$ under $\mathcal{S}$ , completely abstracting from actual timing.

Uniformization also underpins other analysis tasks (such as weak bisimulation), allowing CTMDPs to be treated as embedded discrete-time MDPs (DTMDPs) for selected objectives.

3. Existence and Structure of Optimal Scheduling Policies

A central theoretical result is the constructive existence and computability of optimal schedulers for time-bounded reachability in time-abstract scheduler classes (CD, CR, HD, HR) for arbitrary CTMDPs (Rabe et al., 2010). For every CTMDP, there exists an optimal scheduler with the following properties:

It uses only finite memory—a finite preamble—before “converging” to a memoryless (positional) greedy scheduler.
The greedy scheduler, after this preamble, selects at each location $l$ (not in $B$ ) an action $a$ to maximize—lexicographically—the step probability vector, using the “shifted” vector criterion:

$\text{shift}(d_l) = \sum_{l'} P(l, a, l') \cdot d_{l'},$

with shift $(d_l)[i] = d_l[i+1]$ .

The optimal scheduler thus only needs to be non-greedy in at most $n_m$ steps, where $n_m$ is computable (via comparison of marginal gains $\mu$ against Poisson tails), and then becomes greedy thereafter.

This memoryless convergence greatly simplifies algorithms for policy computation—leading to finite comparison among candidate schedulers in the initial phase and relegating the infinite tail to efficient CTMC analysis.

4. Extensions: Markov Games and Further Models

The existence results for optimal policies extend to Markov games, where nondeterminism is “split” between two antagonistic players (angelic and demonic locations). For time-bounded reachability in uniform Markov games, both players possess deterministic memoryless optimal strategies after a finite preliminary phase, and the game value satisfies

$\sup_{S_A} \inf_{S_D} \mathrm{Pr}^{S_A+S_D}(t) = \inf_{S_D} \sup_{S_A} \mathrm{Pr}^{S_A+S_D}(t).$

These values can be computed as finite sums of the form $\sum_j \eta_j e^{\delta_j}$ .

Extensions to the non-uniform case are addressed via a uniformization procedure, although time-abstract history may then reveal more structure, and further investigation into quantitative improvements and generalized schedulers is suggested (Rabe et al., 2010).

5. Methodological Approaches and Verification

Canonical algorithms for CTMDPs leverage uniformization, finite-memory policy enumeration, and CTMC model checking. For the time-bounded reachability problem:

A candidate optimal scheduler is constructed by selecting, for each history up to $n_m$ steps, the action with maximal partial progress, with a switch to memoryless greedy choices thereafter. The process uses performance vectors and Poisson probability mass decays for cut-off estimation.
Each strategy's probability is exactly computed using sums of the form $\sum \eta_j e^{\delta_j}$ .

Bisimulation and logical characterization efforts have introduced strong/weak bisimulation relations for CTMDPs, tightly relating state-space reductions to satisfaction of temporal logics such as continuous-time stochastic logic (CSL) and its extensions (Song et al., 2012). For broad subclasses (notably, non 2-step recurrent CTMDPs), strong and weak bisimulation coincide with CSL and its “no next” sublogic, offering exact reductions for model checking.

6. Applications and Implications

CTMDPs underpin modeling and synthesis in:

Manufacturing systems: optimizing the probability of timely completion of production steps.
Queueing systems: maximizing or minimizing admission/service probability to hit thresholds within deadlines.
Dependability analysis: computing the maximal probability of safe/failure states being reached within time bounds.
Verification of real-time and stochastic systems: integration with model checking tools for temporal-logic–based system verification (e.g., for safety or liveness in dense time).

The structural results imply that policy representation needs only finite memory prior to a transition to a greedy, memoryless regime, enabling algorithmic tractability and reduction in implementation complexity.

7. Limitations, Challenges, and Future Directions

The complexity of computing optimal policies is controlled by the size of the finite preamble ( $n_m$ ), tied closely to the decay of Poisson tail probabilities and the greedy advantage $\mu$ . When $n_m$ is large, brute-force search over the expanding candidate sets becomes computationally nontrivial.

Uniformization techniques are essential for non-uniform CTMDPs but introduce subtleties in scheduler observability and may reveal timing information absent in uniform models, indicating a need for future research into refined policies and further generalization.

Quantitative improvement of algorithms, reduction of search space for initial histories, and extension to broader classes of scheduling policies remain open research areas (Rabe et al., 2010).

In summary, the CTMDP framework formalizes the continuous-time decision-making problem under uncertainty and nondeterminism, with foundational results proving the sufficiency of finite-memory, eventually memoryless optimal policies for time-bounded reachability, and extending to Markov games. These insights directly enable practical optimization and verification in real-world stochastic, real-time systems subject to reliability, performance, and safety constraints.