Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multinomial Diffusion Process

Updated 21 January 2026
  • Multinomial Diffusion Process is a stochastic discrete-state Markov chain that gradually corrupts categorical data via scheduled resampling.
  • It features analytically tractable marginal and posterior distributions with methods like Gumbel–Max sampling and discrete inversion (DICE) for controlled editing.
  • Widely applied in generative modeling, semantic inpainting, text/speech synthesis, and particle-level simulations, it bridges discrete diffusion with continuous PDE models.

A multinomial diffusion process is a stochastic discrete-state Markov chain defined over categorical data, in which information is gradually corrupted by randomly resampling each token, symbol, or state according to a schedule of probabilities. This formulation encompasses a range of phenomena and applications, from microscopic particle diffusion in statistical physics to generative modeling in machine learning for categorical data such as semantic maps, text, speech, and quantized image codes. By construction, each step in multinomial diffusion consists of a categorical transition kernel that interpolates between the current state and a uniform distribution, yielding analytically tractable posteriors and well-defined variational bounds for training neural denoisers, in contrast to the continuous Gaussian case. The process admits closed-form marginal and posterior distributions, exact ancestral sampling by Gumbel–Max, and enables discrete inversion and controllable editing via approaches such as DICE. The regime encompasses both physical and computational models, including the Multinomial Diffusion Equation for particle-level simulation and coarse-to-fine latent generation for high-dimensional data.

1. Mathematical Construction and Transition Kernels

A multinomial diffusion process operates over a categorical space X={1,...,K}N\mathcal{X} = \{1,...,K\}^N, where each dimension is a one-hot (or integer) encoding of a categorical variable. The forward process defines a TT-step corruption chain,

q(x1:Tx0)=t=1Tq(xtxt1)q(x_{1:T}\mid x_0) = \prod_{t=1}^T q(x_t\mid x_{t-1})

where the transition kernel at step tt is

q(xtxt1)=Cat(xt;πt=(1βt)xt1+βtK1)q(x_t\mid x_{t-1}) = \mathrm{Cat}\left(x_t; \pi^t = (1-\beta_t)x_{t-1} + \frac{\beta_t}{K}\mathbf{1}\right)

with βt(0,1)\beta_t\in(0,1) the noise schedule, αt=1βt\alpha_t = 1-\beta_t, and 1\mathbf{1} the all-ones vector. At every time tt, each token is retained with probability αt\alpha_t or replaced by a uniformly sampled category with probability βt\beta_t (Chen et al., 2023, Baas et al., 2022, Esser et al., 2021, Hoogeboom et al., 2021, He et al., 2024).

The cumulative marginal from x0x_0 to xtx_t admits a closed-form: q(xtx0)=Cat(xt;αˉtx0+(1αˉt)1K1)q(x_t\mid x_0) = \mathrm{Cat}\left(x_t; \bar \alpha_t x_0 + (1-\bar \alpha_t)\frac{1}{K}\mathbf{1}\right) where αˉt=s=1tαs\bar \alpha_t = \prod_{s=1}^t \alpha_s. Thus, at each step, the probability that a token remains uncorrupted is αˉt\bar \alpha_t; otherwise, the token is uniformly random. This forward process is elemental for both classical and modern multinomial diffusion models (Hoogeboom et al., 2021).

2. Reverse Process and Posterior Computation

The reverse (denoising) process seeks to reconstruct uncorrupted data by learning a backward Markov chain: pθ(xt1xt)=Cat(xt1;ppost(xt,x^0))p_\theta(x_{t-1} \mid x_t) = \mathrm{Cat}\left(x_{t-1}; p_{\mathrm{post}}(x_t, \hat x_0)\right) where the exact posterior given x0x_0 is

q(xt1xt,x0)=Cat(xt1;p~(k)=ukvk),uk=αt[xt]k+1αtK,vk=αˉt1[x0]k+1αˉt1Kq(x_{t-1}\mid x_t, x_0) = \mathrm{Cat}\left(x_{t-1}; \tilde p(k) = u_k v_k \right), \quad u_k = \alpha_t [x_t]_k + \frac{1-\alpha_t}{K}, \quad v_k = \bar \alpha_{t-1} [x_0]_k + \frac{1-\bar \alpha_{t-1}}{K}

and then ppostp_{\mathrm{post}} is normalized over kk (Chen et al., 2023, Hoogeboom et al., 2021, He et al., 2024).

In practice, x0x_0 is unknown; thus, a neural network predicts x^0=μθ(xt,t)\hat x_0 = \mu_\theta(x_t, t), a softmax estimate. Generation proceeds via backward ancestral sampling: at each step, sample xt1Cat(ppost(xt,x^0))x_{t-1} \sim \mathrm{Cat}(p_{\mathrm{post}}(x_t, \hat x_0)), commonly using the Gumbel–Max trick for categorical sampling (Chen et al., 2023, He et al., 2024).

3. Training Objectives and Variational Bounds

Multinomial diffusion models are trained by minimizing a time-decomposed evidence lower bound (ELBO): Lvlb=Ex0qt=1TKL(q(xt1xt,x0)  pθ(xt1xt))\mathcal{L}_{\mathrm{vlb}} = \mathbb{E}_{x_0 \sim q} \sum_{t=1}^T \mathrm{KL}\left(q(x_{t-1}\mid x_t, x_0)\,\|\;p_\theta(x_{t-1}\mid x_t)\right) where both terms are categorical distributions, and the KL reduces to cross-entropy between the true and predicted posteriors (Chen et al., 2023, Hoogeboom et al., 2021, Esser et al., 2021). An additional reconstruction term at t=0t=0 ensures x^0\hat x_0 matches the ground truth: L0=k[x0]klog[x^0]k\mathcal{L}_0 = -\sum_k [x_0]_k \log [\hat x_0]_k

Noise schedules (cosine or position-dependent variants) and importance sampling for tt are employed to stabilize training; all computations leverage log-space numerics for categorical probabilities (Hoogeboom et al., 2021).

4. Discrete Inversion and Editing in Categorical Space

Discrete inversion (DICE) records categorical noise residuals during forward corruption and re-injects them at generation time. The inversion follows the actual noise sequence by Gumbel–Max reparameterization,

xt=argmax(log(Qtv(x0))+g)x_t = \arg\max( \log(\overline{Q}_t v(x_0)) + g)

with gGumbel(0,1)g\sim \mathrm{Gumbel}(0,1), then logs the difference between sampled logits and the model logits as zt=yt1y^t1z_t = y_{t-1} - \hat y_{t-1} (He et al., 2024).

At editing, injected residuals and controlled Gumbel noise allow smooth transitions between pure reconstruction and creative synthesis, modulated by parameters λ1,λ2\lambda_1, \lambda_2. No masks or continuous relaxations are needed; all information is preserved in discrete logit corrections (He et al., 2024).

5. Physical Multinomial Diffusion Equation

In physical systems, the Multinomial Diffusion Equation (MDE) is a microscopic model for particle-level diffusion, discretizing space into voxels with Ni(t)N_i(t) particles, each of which can hop left or right with probability κ=DΔt/Δx2\kappa = D\Delta t/\Delta x^2 per time step. Updates are governed by multinomial draws: Ni(t+Δt)=Ni(t)+Li+1tRitLit+Ri1tN_i(t+\Delta t) = N_i(t) + L_{i+1}^t - R_i^t - L_i^t + R_{i-1}^t with (Lit,Rit,NiLitRit)(L_i^t, R_i^t, N_i - L_i^t - R_i^t) multinomially distributed (Balter et al., 2010).

Under suitable scaling (Ni1N_i \gg 1, κ1\kappa \ll 1), the MDE converges in law to the classical stochastic diffusion PDE, but the MDE remains accurate even for low-density regimes where the PDE fails. Simulation comparisons verify precise mass conservation and correct fluctuation statistics, in contrast to the SDE at low NiN_i (Balter et al., 2010).

6. Multinomial Diffusion in Generative Modeling

Semantic Inpainting

In semantic map inpainting, multinomial diffusion fills missing regions by conditioning on observed data. The Look-Back Condition (LB-Con) merges known and unknown regions at every step, enforcing bidirectional consistency via forward–reverse cycles (Chen et al., 2023).

Speech and Text Synthesis

TransFusion trains a multinomial diffusion to denoise random symbol sequences into valid transcriptions of speech, using classifier-free guidance and advanced sampling (resampling, progressive noise) for alignment (Baas et al., 2022). Discrete diffusion exhibits robustness to noise and avoids mode collapse found in continuous text models.

Image Generation

ImageBART employs multinomial diffusion in discrete latent space, synthesized via coarse-to-fine autoregressive transformers. At sampling, global context is provided at each scale, overcoming the unidirectional bias of classic AR models. Multinomial kernels guarantee tractable posteriors and discrete consistency, supporting mask-free inpainting and local image edits (Esser et al., 2021).

Categorical Data Modeling

The multinomial diffusion process yields exact ELBO-based generative models for categorical data, directly optimizing cross-entropy with no need for continuous relaxations. Key tricks include cosine noise schedules, importance sampling, and log-space numerics (Hoogeboom et al., 2021).

7. Connections to Markov Velocity Chains and Generalizations

Path diffusion models, with Markovian velocity processes on discrete grids, instantiate multinomial diffusion as coupled binomial or multinomial chains over possible velocity states. The continuum limit recovers the damped Telegraph and Klein–Gordon equations, with fine grids yielding hyperbolic-function kernels. Generalizations to multi-dimensional velocity spaces under specific rate matrices allow the mean motion to obey Newton's law, providing links between random walks and PDEs governing physical transport phenomena (Beumee et al., 2014).

Table: Forward/Reverse Kernel Formulations

Model/Context Forward Kernel Reverse Posterior (Exact)
Generative modeling Cat((1β)xt1+βK1)\mathrm{Cat}((1-\beta)x_{t-1} + \frac{\beta}{K}\mathbf{1}) Cat(ukvk/jujvj)\mathrm{Cat}(u_k v_k / \sum_j u_j v_j)
Physical MDE Multinomial(NiN_i, κ,κ,12κ\kappa, \kappa, 1-2\kappa) -
Markov velocity chain Matrix kernel with reversal rates (α,β)(\alpha, \beta) -

Applications and Limitations

Multinomial diffusion is employed in generative modeling of semantic data, speech/text, and discrete latent image codes; physical simulation of low-density diffusion; controllable editing in categorical spaces; and as theoretical models connecting discrete Markov chains with PDE limits. Its strengths are analytic tractability, pure categorical consistency, mass conservation in particle models, and elimination of continuous surrogate biases. Main limitations are slow inference (large TT), requirement of tractable sampling, and, in physical models, step-size constraints for mass conservation and numerical stability.

References

  • Semantic map inpainting: "SePaint: Semantic Map Inpainting via Multinomial Diffusion" (Chen et al., 2023)
  • Speech recognition: "TransFusion: Transcribing Speech with Multinomial Diffusion" (Baas et al., 2022)
  • Image generation: "ImageBART: Bidirectional Context with Multinomial Diffusion for Autoregressive Image Synthesis" (Esser et al., 2021)
  • Theory and categorical data applications: "Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions" (Hoogeboom et al., 2021)
  • Discrete inversion and editing: "DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models" (He et al., 2024)
  • Microscopic physical model: "Multinomial Diffusion Equation" (Balter et al., 2010)
  • Markov velocity process and hyperbolic PDE connection: "Path Diffusion, Part I" (Beumee et al., 2014)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Multinomial Diffusion Process.