Distributional Diffusion Models with Scoring Rules (2502.02483v2)

Published 4 Feb 2025 in cs.LG and stat.ML

Abstract: Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively "denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to accomplish sample generation by learning the posterior {\em distribution} of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

Authors (7)

Valentin De Bortoli (50 papers)
Alexandre Galashov (21 papers)
J. Swaroop Guntupalli (6 papers)
Guangyao Zhou (19 papers)
Kevin Murphy (87 papers)
Arthur Gretton (127 papers)
Arnaud Doucet (161 papers)

Summary

Distributional Diffusion Models, as introduced in "Distributional Diffusion Models with Scoring Rules" (Bortoli et al., 4 Feb 2025 ), represent a modification to the standard diffusion model training paradigm aimed at accelerating inference, particularly in the regime of few discretization steps. The core idea is to learn the full conditional posterior distribution $p_{0|t}(x_0 | x_t)$ of the clean data $x_0$ given the noisy data $x_t$ at time $t$ , rather than solely estimating its conditional mean $\mathbb{E}[X_0 | X_t]$ as typically done via Mean Squared Error (MSE) minimization. This richer representation of the reverse process transition probabilities allows for more accurate sampling when using larger time steps, thereby reducing the number of network evaluations required for generation.

Methodology: Learning Posteriors with Scoring Rules

Standard diffusion models, like DDPM, are trained to predict the noise added or, equivalently, the original data $x_0$ from its noisy version $x_t$ . The objective function commonly simplifies to an MSE loss:

$\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1), X_0 \sim p_0, \epsilon \sim \mathcal{N}(0, I)} \| \hat{x}_{\theta}(t, \sqrt{\bar{\alpha}_t}X_0 + \sqrt{1-\bar{\alpha}_t}\epsilon) - X_0 \|^2$

Minimizing this loss trains the network $\hat{x}_{\theta}$ to approximate $\mathbb{E}[X_0 | X_t=x_t]$ . While effective for small time steps $\Delta t$ , this point estimate becomes a poor approximation of the true posterior $p_{0|t}(x_0 | x_t)$ when $\Delta t$ is large, hindering few-step sampling performance.

Distributional Diffusion Models propose learning this posterior distribution directly. They introduce a modified network $\hat{x}_{\theta}(t, x_t, \xi)$ , where $\xi \sim \mathcal{N}(0, I_d)$ is an auxiliary random variable. By varying $\xi$ , the network outputs samples intended to approximate the distribution $p_{0|t}(x_0 | x_t)$ . To train this network, the MSE loss is replaced by a loss derived from a strictly proper scoring rule, which evaluates the quality of the predicted distribution $p^{\theta}_{0|t}(\cdot | x_t)$ (generated by varying $\xi$ for fixed $t, x_t$ ) against the ground truth outcome $x_0$ .

The primary loss function explored is based on the Conditional Generalized Energy Score:

$S_{\lambda, \beta}(p^{\theta}_{0|t}(\cdot|x_t), x_0) = -\mathbb{E}_{X \sim p^{\theta}_{0|t}(\cdot|x_t)}[\|X - x_0\|^{\beta}] + \frac{\lambda}{2} \mathbb{E}_{X, X' \sim p^{\theta}_{0|t}(\cdot|x_t)}[\|X - X'\|^{\beta}]$

where $\beta \in (0, 2]$ and $\lambda \in [0, 1]$ . The first term encourages the predicted samples $X$ to be close to the true $x_0$ (confinement), while the second term encourages diversity among the predicted samples (interaction), preventing distributional collapse. The overall training objective, termed the Energy Diffusion Loss, integrates the expectation of the negative score over time and the data distribution:

$\mathcal{L}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1), (X_0, X_t) \sim p_{0,t}} [-S_{\lambda, \beta}(p^{\theta}_{0|t}(\cdot|X_t), X_0)]$

In practice, the expectations within the score are estimated using Monte Carlo sampling, drawing $m$ samples $\xi^1, ..., \xi^m \sim \mathcal{N}(0, I_d)$ to generate $\hat{x}_{\theta}^j = \hat{x}_{\theta}(t, X_t, \xi^j)$ for $j=1, ..., m$ . The empirical loss is:

$\hat{\mathcal{L}}(\theta) = \frac{1}{m} \sum_{j=1}^m \|\hat{x}_{\theta}^j - X_0\|^{\beta} - \frac{\lambda}{2m^2} \sum_{j, k=1}^m \|\hat{x}_{\theta}^j - \hat{x}_{\theta}^k\|^{\beta}$

Notably, setting $\lambda \to 0$ and $\beta \to 2$ recovers the standard MSE loss (up to constants), positioning standard diffusion as a special case. The framework also supports Kernel Diffusion Losses using other characteristic kernels $\rho$ (like Inverse Multiquadric or RBF) instead of the energy distance $\| \cdot \|^{\beta}$ .

Implementation Details

Training (Algorithm 1):

The training procedure requires modification from standard diffusion model training. For each training step:

Sample a data point $X_0 \sim p_0$ .
Sample a time $t \sim \mathcal{U}(0,1)$ .
Generate the noisy version $X_t = \sqrt{\bar{\alpha}_t}X_0 + \sqrt{1-\bar{\alpha}_t}\epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ .
Sample $m$ independent noise vectors $\xi^1, ..., \xi^m \sim \mathcal{N}(0, I_d)$ .
Compute $m$ posterior samples using the network: $\hat{x}_{\theta}^j = \hat{x}_{\theta}(t, X_t, \xi^j)$ for $j=1, ..., m$ .
Calculate the empirical loss $\hat{\mathcal{L}}(\theta)$ using $X_0$ and the set $\{\hat{x}_{\theta}^j\}_{j=1}^m$ .
Compute gradients $\nabla_{\theta} \hat{\mathcal{L}}(\theta)$ and update parameters $\theta$ .

A key implementation consideration is the computational cost. Computing the interaction term requires a pairwise comparison of $m$ network outputs, incurring an $O(m^2)$ cost. The network forward pass is performed $m$ times per training step. The paper typically uses $m=2$ , making the training roughly twice as expensive as standard diffusion training.

Sampling (Algorithm 2):

Inference utilizes a modified DDIM-like sampling procedure. Given a sequence of time steps $1 = t_N > t_{N-1} > ... > t_1 > t_0 = 0$ :

Start with $X_{t_N} \sim \mathcal{N}(0, I)$ .
For $k = N-1, ..., 0$ : a. Sample an auxiliary noise vector $\xi \sim \mathcal{N}(0, I_d)$ . b. Obtain a sample from the learned posterior: $\hat{X}_0 = \hat{x}_{\theta}(t_{k+1}, X_{t_{k+1}}, \xi)$ . c. Compute the predicted noise direction: $\epsilon_{\theta} = (X_{t_{k+1}} - \sqrt{\bar{\alpha}_{t_{k+1}}} \hat{X}_0) / \sqrt{1 - \bar{\alpha}_{t_{k+1}}}$ . d. Compute the next sample $X_{t_k}$ using the standard DDIM update rule:

$X_{t_k} = \sqrt{\bar{\alpha}_{t_k}} \hat{X}_0 + \sqrt{1 - \bar{\alpha}_{t_k}} \epsilon_{\theta}$

(Assuming $\eta=0$ for deterministic DDIM sampling, though stochastic sampling with $\eta>0$ is also possible).

The crucial difference from standard DDIM is step 2b: instead of using $\hat{x}_{\theta}$ directly as the estimate of $x_0$ , a sample $\hat{X}_0$ is drawn from the learned posterior $p^{\theta}_{0|t_{k+1}}$ by feeding random noise $\xi$ into the network. This injection of stochasticity at each step leverages the learned distributional information.

Architecture:

The paper adapts standard diffusion architectures (e.g., UNets used in image generation). The auxiliary noise $\xi$ is typically incorporated by concatenating it to the time embedding $t$ before processing or by adding it spatially after projecting it to match feature map dimensions.

Hyperparameters:

The key hyperparameters are $\lambda \in [0, 1]$ and $\beta \in (0, 2]$ . $\lambda=1$ corresponds to using a strictly proper score (energy score), theoretically optimal for learning the true posterior. However, empirical results suggest $\lambda < 1$ (e.g., $\lambda=0.5$ ) can offer more robust performance across different numbers of inference steps $N$ . $\beta=2$ connects to MSE, while $\beta < 2$ (e.g., $\beta=1$ for Laplacian-like distance) provides an alternative. The choice of $m$ (typically 2) affects training cost and gradient variance. For Kernel Diffusion, the choice of kernel and its parameters (e.g., bandwidth) is relevant.

Theoretical Analysis

The paper provides analysis supporting the methodology:

Gaussian Case: Analyzing a simple Gaussian case ( $p_0 = \mathcal{N}(\mu_0, \Sigma_0)$ ) reveals that using the energy score with $\lambda=1$ allows recovery of the true posterior variance. In contrast, $\lambda < 1$ leads to underestimation of the posterior variance, effectively shrinking the learned distribution. This shrinkage might explain the robustness of $\lambda < 1$ by mitigating error accumulation over many steps.
Conditional vs. Joint Scores: The paper argues for using conditional scores $S(p^{\theta}_{0|t}(\cdot|x_t), x_0)$ over joint scores $S(p^{\theta}_{0,t}, p_{0,t})$ , suggesting the conditional approach has better Signal-to-Noise Ratio (SNR) properties during training, especially at low noise levels (small $t$ ).
Diffusion Compatibility: A scoring rule (or kernel) is deemed "diffusion compatible" if the corresponding diffusion loss converges to the standard MSE loss in some limit (e.g., $\lambda \to 0, \beta \to 2$ for energy score). This property ensures a smooth connection to the well-established standard diffusion framework. The paper shows that Energy Score, IMQ kernel, and RBF kernel possess this property.

Experimental Results and Performance

The primary claim is validated empirically: Distributional Diffusion Models significantly outperform standard diffusion models (trained with MSE loss and sampled with DDIM) in the few-step inference regime ( $N \approx 10-50$ ).

Image Generation: On CIFAR-10 (32x32), CelebA (64x64), and LSUN Churches (256x256), models trained with Energy Diffusion Loss ( $\lambda=1, \beta=1$ or $\lambda=0.5, \beta=2$ ) achieve substantially lower FID scores than baseline DDIM for $N \leq 50$ . For example, on CIFAR-10 with $N=10$ steps, the FID improves significantly compared to the baseline. Similar gains are observed on latent diffusion models for CelebA-HQ (512x512).
Performance Trade-offs: While often optimal at very few steps ( $N=10, 20$ ), the configuration $\lambda=1, \beta=1$ can sometimes slightly underperform standard diffusion ( $\lambda=0, \beta=2$ ) or models with $\lambda<1$ when using many steps ( $N \approx 100$ ). This suggests potential issues with accumulating approximation errors from the learned posterior over many iterations. Using $\lambda=0.5$ often provides a good balance, performing well at both few and many steps. Interestingly, even $\lambda=0, \beta=1$ (a modified regression loss, not learning a full distribution) sometimes outperforms the standard $\lambda=0, \beta=2$ baseline, suggesting benefits beyond just distributional learning.
Robotics: On the Libero benchmark for robot trajectory generation (simulating sequences of joint positions), the distributional approach again shows improved performance (lower Fréchet Distance between generated and real trajectories) compared to standard diffusion, especially for few inference steps.

These results strongly support the claim that learning the posterior distribution enables more effective coarse-grained sampling.

Practical Implications and Applications

The main practical implication is the potential for significant inference acceleration of diffusion models. By achieving comparable or better sample quality with 5-10x fewer steps (e.g., 20-50 steps instead of 200-1000), Distributional Diffusion Models can make diffusion-based generation feasible in applications sensitive to latency or computational cost.

Real-time Synthesis: Potential applications include faster generation of images, audio snippets, or other media where near real-time performance is desirable.
Resource-Constrained Environments: Reduced computational requirements per sample could facilitate deployment on edge devices or mobile platforms.
Robotics and Control: Faster trajectory planning or policy generation could be beneficial in dynamic robotics tasks.

However, some limitations exist:

Training Cost: Training is moderately more expensive (roughly 2x with $m=2$ ) due to multiple forward passes and the $O(m^2)$ loss term calculation.
Hyperparameter Sensitivity: Performance can depend on the choice of $\lambda, \beta$ , and potentially the kernel type. Finding optimal settings might require careful tuning.
Approximation Quality: The quality of the generated samples still relies on the network's ability to accurately approximate the posterior distribution $p_{0|t}$ . Architectural choices for incorporating $\xi$ might influence this.

Future work might explore adaptive hyperparameter settings, designing architectures better suited for distributional output, or combining this approach with other acceleration techniques like knowledge distillation.

Conclusion

Distributional Diffusion Models (Bortoli et al., 4 Feb 2025 ) offer a principled modification to the standard diffusion training objective by leveraging scoring rules to learn the conditional posterior distribution $p_{0|t}(x_0|x_t)$ . This approach demonstrably improves sample quality in the few-step inference regime compared to traditional methods relying solely on conditional mean estimation. By enabling high-fidelity generation with significantly fewer network evaluations, this work provides a valuable technique for accelerating diffusion models and broadening their applicability in computationally constrained scenarios.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/ArthurGretton/status/1887143973195821422

https://twitter.com/JamesTThorn/status/1899921119140078044

https://twitter.com/sp_monte_carlo/status/1907462428520030461

https://twitter.com/zhouguangyao/status/1887325893200388173

https://twitter.com/arxivsanitybot/status/1887327829219529038