Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Distributional Diffusion Models with Scoring Rules (2502.02483v2)

Published 4 Feb 2025 in cs.LG and stat.ML

Abstract: Diffusion models generate high-quality synthetic data. They operate by defining a continuous-time forward process which gradually adds Gaussian noise to data until fully corrupted. The corresponding reverse process progressively "denoises" a Gaussian sample into a sample from the data distribution. However, generating high-quality outputs requires many discretization steps to obtain a faithful approximation of the reverse process. This is expensive and has motivated the development of many acceleration methods. We propose to accomplish sample generation by learning the posterior {\em distribution} of clean data samples given their noisy versions, instead of only the mean of this distribution. This allows us to sample from the probability transitions of the reverse process on a coarse time scale, significantly accelerating inference with minimal degradation of the quality of the output. This is accomplished by replacing the standard regression loss used to estimate conditional means with a scoring rule. We validate our method on image and robot trajectory generation, where we consistently outperform standard diffusion models at few discretization steps.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Valentin De Bortoli (50 papers)
  2. Alexandre Galashov (21 papers)
  3. J. Swaroop Guntupalli (6 papers)
  4. Guangyao Zhou (19 papers)
  5. Kevin Murphy (87 papers)
  6. Arthur Gretton (127 papers)
  7. Arnaud Doucet (161 papers)

Summary

Distributional Diffusion Models, as introduced in "Distributional Diffusion Models with Scoring Rules" (Bortoli et al., 4 Feb 2025 ), represent a modification to the standard diffusion model training paradigm aimed at accelerating inference, particularly in the regime of few discretization steps. The core idea is to learn the full conditional posterior distribution p0t(x0xt)p_{0|t}(x_0 | x_t) of the clean data x0x_0 given the noisy data xtx_t at time tt, rather than solely estimating its conditional mean E[X0Xt]\mathbb{E}[X_0 | X_t] as typically done via Mean Squared Error (MSE) minimization. This richer representation of the reverse process transition probabilities allows for more accurate sampling when using larger time steps, thereby reducing the number of network evaluations required for generation.

Methodology: Learning Posteriors with Scoring Rules

Standard diffusion models, like DDPM, are trained to predict the noise added or, equivalently, the original data x0x_0 from its noisy version xtx_t. The objective function commonly simplifies to an MSE loss:

Ldiff(θ)=EtU(0,1),X0p0,ϵN(0,I)x^θ(t,αˉtX0+1αˉtϵ)X02\mathcal{L}_{\text{diff}}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1), X_0 \sim p_0, \epsilon \sim \mathcal{N}(0, I)} \| \hat{x}_{\theta}(t, \sqrt{\bar{\alpha}_t}X_0 + \sqrt{1-\bar{\alpha}_t}\epsilon) - X_0 \|^2

Minimizing this loss trains the network x^θ\hat{x}_{\theta} to approximate E[X0Xt=xt]\mathbb{E}[X_0 | X_t=x_t]. While effective for small time steps Δt\Delta t, this point estimate becomes a poor approximation of the true posterior p0t(x0xt)p_{0|t}(x_0 | x_t) when Δt\Delta t is large, hindering few-step sampling performance.

Distributional Diffusion Models propose learning this posterior distribution directly. They introduce a modified network x^θ(t,xt,ξ)\hat{x}_{\theta}(t, x_t, \xi), where ξN(0,Id)\xi \sim \mathcal{N}(0, I_d) is an auxiliary random variable. By varying ξ\xi, the network outputs samples intended to approximate the distribution p0t(x0xt)p_{0|t}(x_0 | x_t). To train this network, the MSE loss is replaced by a loss derived from a strictly proper scoring rule, which evaluates the quality of the predicted distribution p0tθ(xt)p^{\theta}_{0|t}(\cdot | x_t) (generated by varying ξ\xi for fixed t,xtt, x_t) against the ground truth outcome x0x_0.

The primary loss function explored is based on the Conditional Generalized Energy Score:

Sλ,β(p0tθ(xt),x0)=EXp0tθ(xt)[Xx0β]+λ2EX,Xp0tθ(xt)[XXβ]S_{\lambda, \beta}(p^{\theta}_{0|t}(\cdot|x_t), x_0) = -\mathbb{E}_{X \sim p^{\theta}_{0|t}(\cdot|x_t)}[\|X - x_0\|^{\beta}] + \frac{\lambda}{2} \mathbb{E}_{X, X' \sim p^{\theta}_{0|t}(\cdot|x_t)}[\|X - X'\|^{\beta}]

where β(0,2]\beta \in (0, 2] and λ[0,1]\lambda \in [0, 1]. The first term encourages the predicted samples XX to be close to the true x0x_0 (confinement), while the second term encourages diversity among the predicted samples (interaction), preventing distributional collapse. The overall training objective, termed the Energy Diffusion Loss, integrates the expectation of the negative score over time and the data distribution:

L(θ)=EtU(0,1),(X0,Xt)p0,t[Sλ,β(p0tθ(Xt),X0)]\mathcal{L}(\theta) = \mathbb{E}_{t \sim \mathcal{U}(0,1), (X_0, X_t) \sim p_{0,t}} [-S_{\lambda, \beta}(p^{\theta}_{0|t}(\cdot|X_t), X_0)]

In practice, the expectations within the score are estimated using Monte Carlo sampling, drawing mm samples ξ1,...,ξmN(0,Id)\xi^1, ..., \xi^m \sim \mathcal{N}(0, I_d) to generate x^θj=x^θ(t,Xt,ξj)\hat{x}_{\theta}^j = \hat{x}_{\theta}(t, X_t, \xi^j) for j=1,...,mj=1, ..., m. The empirical loss is:

L^(θ)=1mj=1mx^θjX0βλ2m2j,k=1mx^θjx^θkβ\hat{\mathcal{L}}(\theta) = \frac{1}{m} \sum_{j=1}^m \|\hat{x}_{\theta}^j - X_0\|^{\beta} - \frac{\lambda}{2m^2} \sum_{j, k=1}^m \|\hat{x}_{\theta}^j - \hat{x}_{\theta}^k\|^{\beta}

Notably, setting λ0\lambda \to 0 and β2\beta \to 2 recovers the standard MSE loss (up to constants), positioning standard diffusion as a special case. The framework also supports Kernel Diffusion Losses using other characteristic kernels ρ\rho (like Inverse Multiquadric or RBF) instead of the energy distance β\| \cdot \|^{\beta}.

Implementation Details

Training (Algorithm 1):

The training procedure requires modification from standard diffusion model training. For each training step:

  1. Sample a data point X0p0X_0 \sim p_0.
  2. Sample a time tU(0,1)t \sim \mathcal{U}(0,1).
  3. Generate the noisy version Xt=αˉtX0+1αˉtϵX_t = \sqrt{\bar{\alpha}_t}X_0 + \sqrt{1-\bar{\alpha}_t}\epsilon, where ϵN(0,I)\epsilon \sim \mathcal{N}(0, I).
  4. Sample mm independent noise vectors ξ1,...,ξmN(0,Id)\xi^1, ..., \xi^m \sim \mathcal{N}(0, I_d).
  5. Compute mm posterior samples using the network: x^θj=x^θ(t,Xt,ξj)\hat{x}_{\theta}^j = \hat{x}_{\theta}(t, X_t, \xi^j) for j=1,...,mj=1, ..., m.
  6. Calculate the empirical loss L^(θ)\hat{\mathcal{L}}(\theta) using X0X_0 and the set {x^θj}j=1m\{\hat{x}_{\theta}^j\}_{j=1}^m.
  7. Compute gradients θL^(θ)\nabla_{\theta} \hat{\mathcal{L}}(\theta) and update parameters θ\theta.

A key implementation consideration is the computational cost. Computing the interaction term requires a pairwise comparison of mm network outputs, incurring an O(m2)O(m^2) cost. The network forward pass is performed mm times per training step. The paper typically uses m=2m=2, making the training roughly twice as expensive as standard diffusion training.

Sampling (Algorithm 2):

Inference utilizes a modified DDIM-like sampling procedure. Given a sequence of time steps 1=tN>tN1>...>t1>t0=01 = t_N > t_{N-1} > ... > t_1 > t_0 = 0:

  1. Start with XtNN(0,I)X_{t_N} \sim \mathcal{N}(0, I).
  2. For k=N1,...,0k = N-1, ..., 0: a. Sample an auxiliary noise vector ξN(0,Id)\xi \sim \mathcal{N}(0, I_d). b. Obtain a sample from the learned posterior: X^0=x^θ(tk+1,Xtk+1,ξ)\hat{X}_0 = \hat{x}_{\theta}(t_{k+1}, X_{t_{k+1}}, \xi). c. Compute the predicted noise direction: ϵθ=(Xtk+1αˉtk+1X^0)/1αˉtk+1\epsilon_{\theta} = (X_{t_{k+1}} - \sqrt{\bar{\alpha}_{t_{k+1}}} \hat{X}_0) / \sqrt{1 - \bar{\alpha}_{t_{k+1}}}. d. Compute the next sample XtkX_{t_k} using the standard DDIM update rule:

    Xtk=αˉtkX^0+1αˉtkϵθX_{t_k} = \sqrt{\bar{\alpha}_{t_k}} \hat{X}_0 + \sqrt{1 - \bar{\alpha}_{t_k}} \epsilon_{\theta}

    (Assuming η=0\eta=0 for deterministic DDIM sampling, though stochastic sampling with η>0\eta>0 is also possible).

The crucial difference from standard DDIM is step 2b: instead of using x^θ\hat{x}_{\theta} directly as the estimate of x0x_0, a sample X^0\hat{X}_0 is drawn from the learned posterior p0tk+1θp^{\theta}_{0|t_{k+1}} by feeding random noise ξ\xi into the network. This injection of stochasticity at each step leverages the learned distributional information.

Architecture:

The paper adapts standard diffusion architectures (e.g., UNets used in image generation). The auxiliary noise ξ\xi is typically incorporated by concatenating it to the time embedding tt before processing or by adding it spatially after projecting it to match feature map dimensions.

Hyperparameters:

The key hyperparameters are λ[0,1]\lambda \in [0, 1] and β(0,2]\beta \in (0, 2]. λ=1\lambda=1 corresponds to using a strictly proper score (energy score), theoretically optimal for learning the true posterior. However, empirical results suggest λ<1\lambda < 1 (e.g., λ=0.5\lambda=0.5) can offer more robust performance across different numbers of inference steps NN. β=2\beta=2 connects to MSE, while β<2\beta < 2 (e.g., β=1\beta=1 for Laplacian-like distance) provides an alternative. The choice of mm (typically 2) affects training cost and gradient variance. For Kernel Diffusion, the choice of kernel and its parameters (e.g., bandwidth) is relevant.

Theoretical Analysis

The paper provides analysis supporting the methodology:

  • Gaussian Case: Analyzing a simple Gaussian case (p0=N(μ0,Σ0)p_0 = \mathcal{N}(\mu_0, \Sigma_0)) reveals that using the energy score with λ=1\lambda=1 allows recovery of the true posterior variance. In contrast, λ<1\lambda < 1 leads to underestimation of the posterior variance, effectively shrinking the learned distribution. This shrinkage might explain the robustness of λ<1\lambda < 1 by mitigating error accumulation over many steps.
  • Conditional vs. Joint Scores: The paper argues for using conditional scores S(p0tθ(xt),x0)S(p^{\theta}_{0|t}(\cdot|x_t), x_0) over joint scores S(p0,tθ,p0,t)S(p^{\theta}_{0,t}, p_{0,t}), suggesting the conditional approach has better Signal-to-Noise Ratio (SNR) properties during training, especially at low noise levels (small tt).
  • Diffusion Compatibility: A scoring rule (or kernel) is deemed "diffusion compatible" if the corresponding diffusion loss converges to the standard MSE loss in some limit (e.g., λ0,β2\lambda \to 0, \beta \to 2 for energy score). This property ensures a smooth connection to the well-established standard diffusion framework. The paper shows that Energy Score, IMQ kernel, and RBF kernel possess this property.

Experimental Results and Performance

The primary claim is validated empirically: Distributional Diffusion Models significantly outperform standard diffusion models (trained with MSE loss and sampled with DDIM) in the few-step inference regime (N1050N \approx 10-50).

  • Image Generation: On CIFAR-10 (32x32), CelebA (64x64), and LSUN Churches (256x256), models trained with Energy Diffusion Loss (λ=1,β=1\lambda=1, \beta=1 or λ=0.5,β=2\lambda=0.5, \beta=2) achieve substantially lower FID scores than baseline DDIM for N50N \leq 50. For example, on CIFAR-10 with N=10N=10 steps, the FID improves significantly compared to the baseline. Similar gains are observed on latent diffusion models for CelebA-HQ (512x512).
  • Performance Trade-offs: While often optimal at very few steps (N=10,20N=10, 20), the configuration λ=1,β=1\lambda=1, \beta=1 can sometimes slightly underperform standard diffusion (λ=0,β=2\lambda=0, \beta=2) or models with λ<1\lambda<1 when using many steps (N100N \approx 100). This suggests potential issues with accumulating approximation errors from the learned posterior over many iterations. Using λ=0.5\lambda=0.5 often provides a good balance, performing well at both few and many steps. Interestingly, even λ=0,β=1\lambda=0, \beta=1 (a modified regression loss, not learning a full distribution) sometimes outperforms the standard λ=0,β=2\lambda=0, \beta=2 baseline, suggesting benefits beyond just distributional learning.
  • Robotics: On the Libero benchmark for robot trajectory generation (simulating sequences of joint positions), the distributional approach again shows improved performance (lower Fréchet Distance between generated and real trajectories) compared to standard diffusion, especially for few inference steps.

These results strongly support the claim that learning the posterior distribution enables more effective coarse-grained sampling.

Practical Implications and Applications

The main practical implication is the potential for significant inference acceleration of diffusion models. By achieving comparable or better sample quality with 5-10x fewer steps (e.g., 20-50 steps instead of 200-1000), Distributional Diffusion Models can make diffusion-based generation feasible in applications sensitive to latency or computational cost.

  • Real-time Synthesis: Potential applications include faster generation of images, audio snippets, or other media where near real-time performance is desirable.
  • Resource-Constrained Environments: Reduced computational requirements per sample could facilitate deployment on edge devices or mobile platforms.
  • Robotics and Control: Faster trajectory planning or policy generation could be beneficial in dynamic robotics tasks.

However, some limitations exist:

  • Training Cost: Training is moderately more expensive (roughly 2x with m=2m=2) due to multiple forward passes and the O(m2)O(m^2) loss term calculation.
  • Hyperparameter Sensitivity: Performance can depend on the choice of λ,β\lambda, \beta, and potentially the kernel type. Finding optimal settings might require careful tuning.
  • Approximation Quality: The quality of the generated samples still relies on the network's ability to accurately approximate the posterior distribution p0tp_{0|t}. Architectural choices for incorporating ξ\xi might influence this.

Future work might explore adaptive hyperparameter settings, designing architectures better suited for distributional output, or combining this approach with other acceleration techniques like knowledge distillation.

Conclusion

Distributional Diffusion Models (Bortoli et al., 4 Feb 2025 ) offer a principled modification to the standard diffusion training objective by leveraging scoring rules to learn the conditional posterior distribution p0t(x0xt)p_{0|t}(x_0|x_t). This approach demonstrably improves sample quality in the few-step inference regime compared to traditional methods relying solely on conditional mean estimation. By enabling high-fidelity generation with significantly fewer network evaluations, this work provides a valuable technique for accelerating diffusion models and broadening their applicability in computationally constrained scenarios.