Gaussian–Softmax Diffusion

Updated 20 November 2025

Gaussian–Softmax diffusion is a framework that relaxes categorical data by injecting Gaussian noise into its logit representation and mapping it to the probability simplex via the softmax function.
It employs an Ornstein–Uhlenbeck process in logit space and uses an analytic logistic-normal distribution to enable tractable score matching and stable reverse dynamics.
The method has demonstrated practical gains in tasks like CAD sketch generation and uncertainty estimation, outperforming traditional discrete diffusion approaches in key metrics.

Gaussian–Softmax diffusion is a generative modeling framework for discrete and mixed discrete-continuous data, in which Gaussian noise is injected into the logit representation of one-hot (categorical) variables and mapped onto the probability simplex by the softmax function. This process yields a continuous relaxation of categorical distributions, enabling the use of diffusion-based models for tasks where the data are inherently discrete yet require the advantages of continuous geometric modeling, such as in classification, uncertainty estimation, or generative design. The technique is closely related to score-based generative modeling, simplexes, and the logistic-normal distribution, and has been developed and applied in contexts including diffusion on the probability simplex (Floto et al., 2023), uncertainty estimation in neural networks (Lu et al., 2020), and CAD sketch generation (Chereddy et al., 15 Jul 2025).

1. Principles of Gaussian–Softmax Diffusion

The central construction begins by lifting discrete data into a continuous space via the logit (pre-softmax) representation. A continuous-time Ornstein–Uhlenbeck (OU) process is applied in logit space:

$dz_t = -\theta z_t\,dt + \sigma\,dW_t$

for $z_t \in \mathbb{R}^d$ , $t \in [0,T]$ , where $\theta, \sigma > 0$ . The solution at time $t$ is

$z_t \sim \mathcal{N}(\alpha(t)z_0, \beta^2(t)I), \qquad \alpha(t) = e^{-\theta t}, \; \beta^2(t) = \frac{1-e^{-2\theta t}}{2\theta}.$

The noisy logit vector $z_t$ is then mapped to the (d-1)-simplex by

$x_t = \mathrm{softmax}(z_t).$

This pushes the (potentially one-hot) initial data through a process that, at intermediate times, produces “superpositions” (i.e., dense vectors in the simplex), while at $t=0$ the original categorical values are recovered.

The marginal distribution of $x_t$ given $x_0$ is the logistic-normal:

$p_t(x|x_0) = (2\pi)^{-(d-1)/2} |\Sigma(t)|^{-1/2} \prod_{i=1}^d x_i^{-1} \exp\left[ -\frac{1}{2} (\log(x_{1}/x_{d}),..., \log(x_{d-1}/x_{d}))^\top \Sigma(t)^{-1} (\cdots) \right]$

for $\Sigma(t)=\beta^2(t)I$ and logit-mean $\mu(t) = \logit(x_0) e^{-\theta t}$ (Floto et al., 2023).

2. Forward and Reverse Dynamics

The forward process consists of iteratively compounding Gaussian noising in logit space followed by projection onto the simplex. For discrete modeling tasks, such as in SketchDNN, the forward step for a one-hot vector $y_t$ is:

$y_{t+1} = \mathrm{softmax}\left( \sqrt{\alpha_{t+1}}\log y_t + \sqrt{1-\alpha_{t+1}}\epsilon_x \right), \quad \epsilon_x \sim \mathcal{N}(0,I),$

with cumulative form

$y_t = \mathrm{softmax}\left(\sqrt{\bar{\alpha}_t}\log y_0' + \sqrt{1-\bar{\alpha}_t}\epsilon\right),$

where $y_0'$ is label-smoothed to avoid zeros (Chereddy et al., 15 Jul 2025).

The reverse process (sampling/generation) uses an analytically derived posterior, maintaining the same form as the forward process but conditioning on both $y_t$ and the network's prediction of the original (“clean”) $y_0$ :

$\mu_{t-1} = \frac{\sqrt{\alpha_t}(1-\bar{\alpha}_{t-1})\log y_t' + \sqrt{\bar{\alpha}_{t-1}(1-\alpha_t)}\log y_0'}{1-\bar{\alpha}_t}$

followed by

$y_{t-1} = \mathrm{softmax}\left( \mu_{t-1} + \sigma_{t-1}\epsilon \right), \;\; \sigma_{t-1}^2 = \frac{(1-\alpha_t)(1-\bar{\alpha}_{t-1})}{1-\bar{\alpha}_t}$

where each component is evaluated in logit-space (Chereddy et al., 15 Jul 2025).

3. Score-Based Training and Practical Discretization

Gaussian–Softmax diffusion admits an exact transition kernel, enabling tractable score-matching:

$\mathcal{L}(\theta) = \mathbb{E}_{t, x_0, x_t} \, \lambda(t) \, \| G(x_t, t)^\top s_\theta(x_t, t) - G(x_t, t)^\top \nabla_x \log p_t(x_t|x_0) \|^2$

Here $s_\theta$ is the learned score, $G$ is the simplex-compatible diffusion matrix, and the true score is derived from the analytic logistic-normal. Prediction and training are implemented either via constrained SDEs or by direct optimization of denoising objectives involving cross-entropy for discrete variables and MSE for continuous variables (Floto et al., 2023, Chereddy et al., 15 Jul 2025).

The sampling algorithm uses standard Euler–Maruyama discretization, often complemented by higher-order solvers (e.g., Heun's method) for stability. The reverse process omits explicit divergence terms when absorbed into the learned network or when using variance-preserving schemes.

4. Discrete, Continuous, and Hybrid Applications

Gaussian–Softmax diffusion is critical in domains demanding structured discrete data modeling, such as:

CAD sketch generation: In "SketchDNN," two parallel diffusion chains are run: continuous Gaussian diffusion for geometric parameters and Gaussian–Softmax diffusion for multi-class primitive and attribute labels, with both chains permutation-equivariant to handle variable CAD part orderings (Chereddy et al., 15 Jul 2025).
Probability simplex and unit cube modeling: The framework generalizes to tasks where outputs must stay within a simplex or cube, extending via independent 1D OU → sigmoid chains, crucial for bounded-value generation—e.g., image modeling on $[0, 1]^d$ (Floto et al., 2023).
Predictive uncertainty: In the context of deep neural networks, Gaussian–Softmax integrals connect to uncertainty estimation, offering closed-form, mean-field approximations for predictive distributions, effectively simulating deep ensembles using only first and second moments of the parameter-induced Gaussian on logits (Lu et al., 2020).

Compared to alternative discrete diffusion techniques:

Categorical SDEs (Simplex Diffusion) typically use Cox–Ingersoll–Ross processes with non-Dirichlet transient laws, lacking closed-form transitions at arbitrary times. Gaussian–Softmax maintains logistic-normal marginals at all times (Floto et al., 2023).
Reflected diffusion applies Brownian motion with boundary reflection to enforce $[0,1]^d$ constraints, but yields no analytic score and incurs complex boundary treatment requirements (Floto et al., 2023).
Classical multinomial or categorical diffusion lacks superposition: one-hot samples are merely permuted, without intermediate soft mixing—Gaussian–Softmax yields true convex superpositions at intermediate times (Chereddy et al., 15 Jul 2025).

The method is fundamentally grounded in the mapping between Ornstein–Uhlenbeck processes in logit space and the logistic-normal distribution on the simplex, a construction that is both analytically tractable and empirically effective.

6. Numerical and Empirical Considerations

Empirical evaluation in SketchDNN demonstrates significant improvement over previous CAD generative models, with Fréchet Inception Distance reduced from 16.04 to 7.80 and negative log-likelihood improved from 84.8 to 81.33 on SketchGraphs (Chereddy et al., 15 Jul 2025). In uncertainty estimation tasks, mean-field Gaussian–Softmax approximations match or outperform deep ensembles and other single-model methods, especially for out-of-distribution detection (Lu et al., 2020).

Diffusion schedules require augmentation for the discrete chain; using naive continuous cosine schedules in logit-space can destroy class identity too quickly. An "augmented" schedule $\{\bar b_t\}$ , analytically linked to the probability of recovering the correct argmax, is adopted for gradual relaxing of one-hotness (Chereddy et al., 15 Jul 2025).

7. Implementation and Modeling Details

Architecturally, state-of-the-art models such as SketchDNN deploy deep transformers without positional encoding to ensure permutation-equivariance over unordered primitives. Typical training uses cosine schedules for drift coefficients, label smoothing near the simplex boundary, and a balanced loss function (high weight on MSE for geometric variables at early steps, reduced later).

The denoising process for each forward/backward step factorizes across primitives and features, allowing efficient parallelization. Sampling proceeds by alternating between continuous and discrete chains, with the latter always performed in logit-space before projection to the simplex.

This overall framework provides analytic, flexible, and highly effective generative diffusion machinery for discrete or mixed domains. By seamlessly blending continuous stochasticity and simplex projection, Gaussian–Softmax diffusion resolves the fundamental incompatibility of continuous diffusion with the categorical structure of discrete data, providing a universal foundation for modern generative modeling of structured objects (Floto et al., 2023, Chereddy et al., 15 Jul 2025, Lu et al., 2020).