Multinomial Diffusion Process

Updated 21 January 2026

Multinomial Diffusion Process is a stochastic discrete-state Markov chain that gradually corrupts categorical data via scheduled resampling.
It features analytically tractable marginal and posterior distributions with methods like Gumbel–Max sampling and discrete inversion (DICE) for controlled editing.
Widely applied in generative modeling, semantic inpainting, text/speech synthesis, and particle-level simulations, it bridges discrete diffusion with continuous PDE models.

A multinomial diffusion process is a stochastic discrete-state Markov chain defined over categorical data, in which information is gradually corrupted by randomly resampling each token, symbol, or state according to a schedule of probabilities. This formulation encompasses a range of phenomena and applications, from microscopic particle diffusion in statistical physics to generative modeling in machine learning for categorical data such as semantic maps, text, speech, and quantized image codes. By construction, each step in multinomial diffusion consists of a categorical transition kernel that interpolates between the current state and a uniform distribution, yielding analytically tractable posteriors and well-defined variational bounds for training neural denoisers, in contrast to the continuous Gaussian case. The process admits closed-form marginal and posterior distributions, exact ancestral sampling by Gumbel–Max, and enables discrete inversion and controllable editing via approaches such as DICE. The regime encompasses both physical and computational models, including the Multinomial Diffusion Equation for particle-level simulation and coarse-to-fine latent generation for high-dimensional data.

1. Mathematical Construction and Transition Kernels

A multinomial diffusion process operates over a categorical space $\mathcal{X} = \{1,...,K\}^N$ , where each dimension is a one-hot (or integer) encoding of a categorical variable. The forward process defines a $T$ -step corruption chain,

$q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$

where the transition kernel at step $t$ is

$q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$

with $\beta_t\in(0,1)$ the noise schedule, $\alpha_t = 1-\beta_t$ , and $\mathbf{1}$ the all-ones vector. At every time $t$ , each token is retained with probability $\alpha_t$ or replaced by a uniformly sampled category with probability $T$ 0 (Chen et al., 2023, Baas et al., 2022, Esser et al., 2021, Hoogeboom et al., 2021, He et al., 2024).

The cumulative marginal from $T$ 1 to $T$ 2 admits a closed-form: $T$ 3 where $T$ 4. Thus, at each step, the probability that a token remains uncorrupted is $T$ 5; otherwise, the token is uniformly random. This forward process is elemental for both classical and modern multinomial diffusion models (Hoogeboom et al., 2021).

2. Reverse Process and Posterior Computation

The reverse (denoising) process seeks to reconstruct uncorrupted data by learning a backward Markov chain: $T$ 6 where the exact posterior given $T$ 7 is

$T$ 8

and then $T$ 9 is normalized over $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 0 (Chen et al., 2023, Hoogeboom et al., 2021, He et al., 2024).

In practice, $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 1 is unknown; thus, a neural network predicts $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 2, a softmax estimate. Generation proceeds via backward ancestral sampling: at each step, sample $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 3, commonly using the Gumbel–Max trick for categorical sampling (Chen et al., 2023, He et al., 2024).

3. Training Objectives and Variational Bounds

Multinomial diffusion models are trained by minimizing a time-decomposed evidence lower bound (ELBO): $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 4 where both terms are categorical distributions, and the KL reduces to cross-entropy between the true and predicted posteriors (Chen et al., 2023, Hoogeboom et al., 2021, Esser et al., 2021). An additional reconstruction term at $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 5 ensures $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 6 matches the ground truth: $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 7

Noise schedules (cosine or position-dependent variants) and importance sampling for $q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 8 are employed to stabilize training; all computations leverage log-space numerics for categorical probabilities (Hoogeboom et al., 2021).

4. Discrete Inversion and Editing in Categorical Space

Discrete inversion (DICE) records categorical noise residuals during forward corruption and re-injects them at generation time. The inversion follows the actual noise sequence by Gumbel–Max reparameterization,

$q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})$ 9

with $t$ 0, then logs the difference between sampled logits and the model logits as $t$ 1 (He et al., 2024).

At editing, injected residuals and controlled Gumbel noise allow smooth transitions between pure reconstruction and creative synthesis, modulated by parameters $t$ 2. No masks or continuous relaxations are needed; all information is preserved in discrete logit corrections (He et al., 2024).

5. Physical Multinomial Diffusion Equation

In physical systems, the Multinomial Diffusion Equation (MDE) is a microscopic model for particle-level diffusion, discretizing space into voxels with $t$ 3 particles, each of which can hop left or right with probability $t$ 4 per time step. Updates are governed by multinomial draws: $t$ 5 with $t$ 6 multinomially distributed (Balter et al., 2010).

Under suitable scaling ( $t$ 7, $t$ 8), the MDE converges in law to the classical stochastic diffusion PDE, but the MDE remains accurate even for low-density regimes where the PDE fails. Simulation comparisons verify precise mass conservation and correct fluctuation statistics, in contrast to the SDE at low $t$ 9 (Balter et al., 2010).

6. Multinomial Diffusion in Generative Modeling

Semantic Inpainting

In semantic map inpainting, multinomial diffusion fills missing regions by conditioning on observed data. The Look-Back Condition (LB-Con) merges known and unknown regions at every step, enforcing bidirectional consistency via forward–reverse cycles (Chen et al., 2023).

Speech and Text Synthesis

TransFusion trains a multinomial diffusion to denoise random symbol sequences into valid transcriptions of speech, using classifier-free guidance and advanced sampling (resampling, progressive noise) for alignment (Baas et al., 2022). Discrete diffusion exhibits robustness to noise and avoids mode collapse found in continuous text models.

Image Generation

ImageBART employs multinomial diffusion in discrete latent space, synthesized via coarse-to-fine autoregressive transformers. At sampling, global context is provided at each scale, overcoming the unidirectional bias of classic AR models. Multinomial kernels guarantee tractable posteriors and discrete consistency, supporting mask-free inpainting and local image edits (Esser et al., 2021).

Categorical Data Modeling

The multinomial diffusion process yields exact ELBO-based generative models for categorical data, directly optimizing cross-entropy with no need for continuous relaxations. Key tricks include cosine noise schedules, importance sampling, and log-space numerics (Hoogeboom et al., 2021).

7. Connections to Markov Velocity Chains and Generalizations

Path diffusion models, with Markovian velocity processes on discrete grids, instantiate multinomial diffusion as coupled binomial or multinomial chains over possible velocity states. The continuum limit recovers the damped Telegraph and Klein–Gordon equations, with fine grids yielding hyperbolic-function kernels. Generalizations to multi-dimensional velocity spaces under specific rate matrices allow the mean motion to obey Newton's law, providing links between random walks and PDEs governing physical transport phenomena (Beumee et al., 2014).

Table: Forward/Reverse Kernel Formulations

Model/Context	Forward Kernel	Reverse Posterior (Exact)
Generative modeling	$q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$ 0	$q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$ 1
Physical MDE	Multinomial( $q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$ 2, $q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$ 3)	-
Markov velocity chain	Matrix kernel with reversal rates $q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$ 4	-

Applications and Limitations

Multinomial diffusion is employed in generative modeling of semantic data, speech/text, and discrete latent image codes; physical simulation of low-density diffusion; controllable editing in categorical spaces; and as theoretical models connecting discrete Markov chains with PDE limits. Its strengths are analytic tractability, pure categorical consistency, mass conservation in particle models, and elimination of continuous surrogate biases. Main limitations are slow inference (large $q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)$ 5), requirement of tractable sampling, and, in physical models, step-size constraints for mass conservation and numerical stability.

References

Semantic map inpainting: "SePaint: Semantic Map Inpainting via Multinomial Diffusion" (Chen et al., 2023)
Speech recognition: "TransFusion: Transcribing Speech with Multinomial Diffusion" (Baas et al., 2022)
Image generation: "ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis" (Esser et al., 2021)
Theory and categorical data applications: "Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions" (Hoogeboom et al., 2021)
Discrete inversion and editing: "DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models" (He et al., 2024)
Microscopic physical model: "Multinomial Diffusion Equation" (Balter et al., 2010)
Markov velocity process and hyperbolic PDE connection: "Path Diffusion, Part I" (Beumee et al., 2014)