Papers
Topics
Authors
Recent
2000 character limit reached

Denoising Score-Matching Loss for EBMs

Updated 7 October 2025
  • Denoising score-matching loss is a method that trains energy-based models by estimating the gradient of log-densities from noise-corrupted data.
  • Multi-scale denoising score matching uses a schedule of noise levels to capture both local details and global structure, improving sample quality and restoration.
  • Empirical tests on datasets like CIFAR-10 show that this approach achieves competitive inception scores and effective image inpainting compared to GANs.

Denoising score-matching loss is a fundamental training criterion for energy-based models (EBMs) and a cornerstone of recent advances in generative modeling with EBMs and score-based diffusion models. The key idea is to train a parametric model to match the score (gradient of log-density) of a noise-perturbed data distribution, thereby circumventing the intractability of likelihood-based learning in high-dimensional settings. Notably, (Li et al., 2019) introduces and analyzes a multiscale extension—multi-scale denoising score matching (MDSM)—arguing that multiple noise levels are required for generative modeling of high-dimensional data, and demonstrating strong empirical performance both in sample synthesis quality and in downstream tasks such as image inpainting and density estimation.

1. Denoising Score-Matching Loss: Definition and Principle

Denoising score matching (DSM) [Vincent 2011] is an estimator for the score function ∇ₓ̃ log q₍σ₎(𝑥̃), where q₍σ₎(𝑥̃|x) is a corruption process (e.g., additive Gaussian noise) and q₍σ₎(𝑥̃) = ∫ q₍σ₎(𝑥̃|x) p(x) dx is the noisy data distribution. DSM formalizes the learning objective as follows:

Given a parameterized energy model E(·; θ), DSM trains θ by minimizing

Ep(x)Eqσ(x~x)[xx~+σ2x~E(x~;θ)2],\mathbb{E}_{p(x)}\mathbb{E}_{q_{\sigma}(\tilde{x} | x)} \left[ \left\| x - \tilde{x} + \sigma^2 \nabla_{\tilde{x}} E(\tilde{x}; \theta) \right\|^2 \right],

where typically q₍σ₎(𝑥̃|x) = 𝒩(𝑥̃; x, σ²I).

This loss achieves two objectives: (1) it enables fast and stable training by turning the intractable score-matching loss (which involves the model's second derivatives) into a tractable regression against the denoising direction x – 𝑥̃, and (2) it imparts a denoising oracle property, making ∇ₓ̃ E(𝑥̃; θ) directly usable for data restoration.

2. Multi-scale Denoising Score Matching (MDSM) in High Dimensions

In high-dimensional settings, where data concentrates in thin shells at distance ~√dσ from the data manifold (for ambient dimension d), score estimation at a single noise level only probes a narrow region around the manifold and typically under-covers the generative support necessary for sample synthesis. MDSM addresses this by training with a schedule of multiple noise levels {σ₁, …, σ_K}:

L(θ)=σ{σ1,...,σK}Eqσ(x~x)p(x)[l(σ)xx~+σ02x~E(x~;θ)2],L(\theta) = \sum_{\sigma \in \{\sigma_1, ..., \sigma_K\}} \mathbb{E}_{q_\sigma(\tilde{x} | x)p(x)}\left[ l(\sigma)\cdot\| x - \tilde{x} + \sigma_0^2 \nabla_{\tilde{x}} E(\tilde{x}; \theta) \|^2 \right],

where l(σ) is a decreasing function (e.g., l(σ) = 1/σ²) that compensates for scale-induced variance in the loss term. This aggregates training signals from broad to fine scales, ensuring the model sees a representative cross-section of the ambient space and learns the data geometry over the entire support.

3. Empirical and Quantitative Evidence

Extensive experiments on datasets such as MNIST, Fashion MNIST, CelebA, and CIFAR-10 demonstrate that single-noise-level DSM fails to produce high-quality samples, with the learned score effectively being accurate only in an annulus surrounding the data manifold. The MDSM approach, by contrast, achieves visually superior and diverse synthesis, with quantitative performance matching or surpassing GANs. For instance, on CIFAR-10, MDSM-trained EBMs achieve an Inception score of 8.31 and FID of 31.7, comparable to modern GAN baselines. Furthermore, image inpainting experiments (clamping known parts of an image and sampling the rest) show that MDSM-trained models yield coherent and realistic completions, directly leveraging the model’s learned score field.

The following table summarizes comparative generation metrics:

Dataset MDSM (Inception) MDSM (FID) GAN (best, FID)
CIFAR-10 8.31 31.7 ≈25
CelebA High quality N/A N/A

(Results interpolated directly from the provided summary; specific numbers for GANs are only referenced comparatively.)

4. Comparative Analysis with Score-based Models and GANs

MDSM-trained EBMs are set apart from both GANs and score-based NCSN models:

  • Against GANs: MDSM delivers comparable visual and quantitative performance, with the added benefit of providing explicit (unnormalized) density estimates—a property lacking in GANs and critical for tasks such as anomaly detection and density-based reasoning. The EBM naturally supports inpainting and restoration tasks by sampling under partial evidence.
  • Against score-based models (e.g., NCSN): Score-based models like NCSN estimate ∇ₓ̃ log q₍σ₎(𝑥̃) parameterized directly by σ, requiring a network to ingest both noisy data and noise magnitude. MDSM instead learns a single scalar potential, leading to a parsimonious model and potentially improved consistency with the EBM formulation. However, NCSN’s approach can offer flexibility in settings requiring noise-level conditioning, and differences in performance may trace back to approximation steps or the strictness of potential/score parameterization.

5. Applications, Efficiency, and Broader Implications

The MDSM framework enables a spectrum of applications:

  • Sampling and Generation: Data generation from noise via annealed Langevin dynamics, yielding high-fidelity samples.
  • Denoising and Restoration: Direct gradient-based denoising and inpainting, exploiting the vector field structure of the learned score.
  • Density Estimation: Availability of (unnormalized) density evaluations enables anomaly detection and plug-in Bayesian reasoning.
  • Computational Efficiency: By eliminating the need for explicit MCMC from the model distribution in training (required for maximum likelihood in EBMs), MDSM accelerates training by approximately an order of magnitude. This makes high-dimensional EBM training tractable on contemporary hardware.

The multi-scale approach is particularly effective at overcoming measure concentration in high dimensions, where local learning at a single noise shell fails to “cover” the generative data geometry. By training scores at multiple radii, the model resolves both local noise-driven variation and global structure—key for robust generative modeling.

6. Theoretical and Methodological Notes

MDSM is motivated by the observation that the DSM loss aligns the energy gradient with the “return path” from noisy samples to the data. In high d, single-scale DSM sees only the thin shell at radius ∼√d σ, neglecting all regions farther from or closer to the manifold. The multi-scale extension corrects this by training across a range of σ, with weights compensating for variance scaling. This aligns the approach with the theoretical results from recent score-based models yet retains the benefits of energy-based parameterization.

The loss function for a Gaussian kernel is explicitly

xx~+σ2x~E(x~;θ)2,\| x - \tilde{x} + \sigma^2 \nabla_{\tilde{x}} E(\tilde{x}; \theta) \|^2,

with generalization to the multi-scale regime by summing and weighting over noise scales.

7. Practical Guidance

For implementation, use the following sequence:

  1. Define a set of σ spanning the range needed for the dataset (e.g., from small to large).
  2. At each iteration, sample a σ, corrupt data x with noise 𝑥̃ ~ 𝒩(x, σ²I).
  3. Compute the loss

    l(σ) * || x – 𝑥̃ + σ² ∇ₓ̃ E(𝑥̃; θ) ||²

where l(σ)=1/σ² or similar.

  1. Backpropagate and update model parameters.
  2. For generation, initialize samples with standard normal noise and run annealed Langevin dynamics from high to low σ, using the learned energy gradients to iteratively denoise.

MDSM's balance of expressivity, sample synthesis quality, and computational tractability makes it a strong baseline for high-dimensional EBM training and a strong alternative to GANs and score-based models in real-world applications involving image synthesis, restoration, and density-based anomaly detection (Li et al., 2019).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Denoising Score-Matching Loss.