Continuous-Time Discrete Diffusion

Updated 20 November 2025

Continuous-Time Discrete Diffusion is a framework defined via CTMCs that enables probabilistic modeling and generative sampling in discrete spaces with rigorous error control.
The approach employs forward noising and backward denoising processes, using score function estimation and efficient methods like uniformization and τ-leaping.
Empirical benchmarks in image, graph, and sequence generation underscore its competitive performance, scalability, and theoretical robustness.

Continuous-Time Discrete Diffusion models constitute a rigorous and versatile methodology for probabilistic modeling, estimation, and generative sampling in discrete state spaces governed by continuous-time stochastic dynamics. Mathematically, these processes are formalized as Continuous-Time Markov Chains (CTMCs)—finite or countable-state jump processes with well-defined generator matrices—which enable analytically tractable forward noising and theoretically principled backward (denoising or generative) sampling. These frameworks achieve provable error control, efficient implementation, and support for high-dimensional discrete data such as sequences, images, and graphs, and they establish deep correspondences with their continuous state-space diffusion counterparts.

1. Mathematical Construction and Stochastic Integral Framework

Continuous-time discrete diffusion models embed data in a finite (or countably infinite) configuration space $\mathcal{X}$ , evolving according to a CTMC with generator $Q_t(x, y)$ that specifies the instantaneous rate of jumping from $x$ to $y$ . The evolution of the law $p_t$ is described by the Kolmogorov forward (master) equation: $\frac{d}{dt}\,p_t(x) = \sum_{y\in\mathcal{X}} Q_t(y, x)\,p_t(y),$ with usual constraints $Q_t(x, y)\geq 0$ ( $x\neq y$ ), $Q_t(x, x) = -\sum_{y \ne x} Q_t(x, y)$ . This evolution admits a stochastic pathwise representation using compensated Poisson random measures: the sample path $X_t$ can be written as

$X_t = X_0 + \int_0^t \int_{\mathcal{X}} (y - X_{s^-})\,N[Q_s](ds, dy),$

where $N[Q_s](ds, dy)$ is a Poisson random measure specified by the jump intensities $Q_s(y, X_{s^-})$ . The compensated form splits into a martingale component and a compensator, yielding a Lévy-type stochastic integral parallel to the Itô calculus for SDEs (Ren et al., 4 Oct 2024).

A crucial structural property is the Girsanov theorem for Poisson random measures: a change of measure weighted by a positive predictable process $h_t(y)$ modifies the jump intensity from $Q_t(y, X_t)$ to $Q_t(y, X_t)\,h_t(y)$ with an explicit Radon–Nikodym derivative, supporting the analysis of likelihood ratios, KL divergence, and pathwise variational objectives (Ren et al., 4 Oct 2024).

2. Forward and Reverse Processes: Generators and Time-Reversal

The forward noising process is constructed as a CTMC whose generator $Q_t$ is typically factorized coordinate-wise for high-dimensional $\mathcal{X} = [S]^d$ or $[0,1]^n$ , often instantiating uniform or birth-death jump rates to ensure ergodic mixing to a simple reference law (e.g., uniform or absorbing state). For time-homogeneous chains, transition kernels are given by $P(t) = \exp(Qt)$ (Chen et al., 12 Feb 2024, Kirkby et al., 2021).

The exact time-reversal of a CTMC remains a CTMC with a modified (generally time-inhomogeneous) generator: $Q_t^{\mathrm{rev}}(x, y) = Q_{T-t}(y, x)\,\frac{p_{T-t}(y)}{p_{T-t}(x)},$ where $p_{T-t}$ is the marginal of the forward chain at time $T-t$ (Campbell et al., 2022). This induces a dependence of the reverse jump rates on intractable density ratios. In high dimension, both forward and reverse generators are typically implemented in a coordinate-wise fashion, with all transitions restricted to Hamming distance 1 (Zhang et al., 3 Oct 2024).

3. Score Function Estimation and Variational Objectives

The discrete score function, playing the role analogous to $\nabla_x\log p_t(x)$ in continuous diffusion models, is the conditional ratio $s_t(x)_{i, \hat x^i} = p_t(x^{\backslash i}\odot \hat x^i)/p_t(x)$ , representing the conditional probability of coordinate- $i$ replacement. The reverse transition rates are parameterized using these score estimates.

Training is based on minimizing a continuous-time variational lower bound (ELBO), path-space KL, or score-entropy losses. For instance, the pathwise KL between the forward and learned reverse process decomposes as

$\mathrm{KL}(p_{0:T} \Vert q_{0:T}) = \mathrm{KL}(p_0 \Vert q_0) + \mathbb{E}_{p}\left[ \int_{0}^{T} \int \left( \mu_s(y)\log\frac{\mu_s(y)}{\hat\mu_{\lfloor s\rfloor}(y)} - \mu_s(y) + \hat\mu_{\lfloor s\rfloor}(y) \right)\, \nu(dy)\, ds \right],$

quantifying both score-matching error and discretization error from approximating the reverse intensity (Ren et al., 4 Oct 2024, Zhao et al., 6 Feb 2024, Sun et al., 2022). The practical loss typically collapses to coordinate-wise cross-entropy between the predicted and true singleton conditionals (Campbell et al., 2022, Zhang et al., 3 Oct 2024).

For parameter learning, the score or ratio estimates can be trained using time-integrated expected cross-entropy or KL divergence between model and true conditionals for all coordinates and all times, ensuring unbiased estimation in the sense of Hyvärinen's score matching (Sun et al., 2022, Zhao et al., 6 Feb 2024).

4. Sampling Algorithms and Error Analysis

Several sampling schemes solve the reverse (generative) CTMC. The uniformization algorithm provides an exact simulation: transitions are scheduled at Poisson times with a fixed upper-bound rate $\lambda$ and at each event a move is made according to scaled generator entries (Chen et al., 12 Feb 2024, Huang et al., 28 May 2025). An alternative is the $\tau$ -leaping method, wherein the process is approximated by freezing the jump rates over intervals of length $\tau$ ; jumps are then applied based on Poisson counts per interval (Ren et al., 4 Oct 2024, Campbell et al., 2022, Xu et al., 19 May 2024).

The theoretical error analysis decomposes the pathwise KL between the model and true sampling process into:

Truncation error: due to stopping at finite $T$ instead of forwarding to stationarity;
Score approximation error: incurred by replacing true density ratios with model estimates in the reverse generator;
Discretization error: arising from piecewise-constant approximations to time-inhomogeneous intensities (order $O(\tau)$ in $\tau$ -leaping)(Ren et al., 4 Oct 2024).

Explicit complexity bounds are established for uniformization and $\tau$ -leaping. For grid models with dimension $d$ , uniformization achieves $O(d\log d)$ steps to reach TV or KL error $\epsilon$ (Chen et al., 12 Feb 2024), and in Quantized Transition Diffusion (QTD), $O(d\ln^2(d/\epsilon))$ neural evaluations suffice under minimal score assumptions (Huang et al., 28 May 2025).

5. Empirical Benchmarks and Applied Domains

Continuous-time discrete diffusion models are empirically validated across domains:

Image generation: Models such as Blackout Diffusion and Score-based CT-DDMs achieve Inception scores and FID metrics competitive with continuous-state or discrete-time baselines (Santos et al., 2023, Sun et al., 2022, Campbell et al., 2022, Huang et al., 28 May 2025). In detailed studies, $\tau$ -leaping and predictor–corrector samplers markedly reduce the number of function evaluations per high-quality sample (Campbell et al., 2022).
Graph generation: Discrete-state continuous-time diffusion models (e.g., Cometh, DisCo) attain state-of-the-art Valid-Unique-Novel scores and invariance properties on molecular and non-molecular graph tasks (Siraudin et al., 10 Jun 2024, Xu et al., 19 May 2024).
Language and sequence modeling: Hybrid continuous-time models with jump processes (DiffuSeq-v2, DNDM) accelerate text generation up to $\sim$ 10 to $800\times$ compared to discrete-time solvers while preserving quality (Gong et al., 2023, Chen et al., 2023).
Special cases: The Ehrenfest process bridges the discrete CTMC with the continuous Ornstein–Uhlenbeck SDE, supporting transfer of denoising score learning between discrete and continuous spaces (Winkler et al., 6 May 2024).

6. Extensions, Limitations, and Open Theoretical Issues

Continuous-time discrete diffusion is unifying for generative modeling in discrete domains, offering exact simulation (via uniformization), analytic marginal distributions for certain rate matrices, and theoretically controlled error. However, practical deployment raises open questions:

Controlling the mixing time and global jump rate $\lambda$ in nonuniform or stiff generators, especially in high-dimensional or structured data (Chen et al., 12 Feb 2024);
Accelerating samplers beyond worst-case $\lambda$ bounds by state-adaptive leap methods or more rapidly mixing bases, such as expanders or nonlocal kernels (Ren et al., 4 Oct 2024);
Handling unbounded or ill-conditioned densities near $t\downarrow0$ (early stopping, score clipping) (Zhang et al., 3 Oct 2024);
Generalizing error analysis to arbitrary non-local discrete transitions and sparse or structured state spaces (Huang et al., 28 May 2025);
Efficient score learning for non-factorized or permutation-invariant objects (e.g., graphs, sequences), including exploiting equivariant architectures and novel positional encodings (Siraudin et al., 10 Jun 2024, Xu et al., 19 May 2024).

7. Connections to Continuous Diffusion and Hybrid Models

Continuous-time discrete diffusion provides discrete-state analogues of continuous SDE-based score models, with Poisson random measure noise replacing Brownian motion. In the scaling limit (fine discretization, large state cardinality), discrete CTMCs converge to standard SDE models (e.g., the Ehrenfest-to-O-U bridge) (Winkler et al., 6 May 2024). Several frameworks blend discrete diffusions with softmax or simplex-valued continuous representations or with discrete jumps in hybrid spaces (e.g., for categorical or bounded data, or for embedding discrete text into continuous spaces) (Floto et al., 2023, Gong et al., 2023). This hybridization facilitates theoretical unification and comparison of sampling and error properties across paradigms.