Denoising-EBM: Energy-Based Models

Updated 16 April 2026

Denoising-EBM is a class of energy-based models that employs denoising score matching to learn unnormalized densities across multiple noise scales.
It decomposes energy into texture and semantic components, enabling high-fidelity synthesis and improved adversarial robustness.
Advanced sampling techniques like two-stage Langevin MCMC and moment-matching Gibbs sampling facilitate efficient inference in high-dimensional settings.

Denoising-EBM refers to a class of energy-based models (EBMs) in which the energy function is constructed or trained via denoising principles, specifically leveraging denoising score matching or denoising autoencoders. These models have exhibited state-of-the-art performance in generative modeling, adversarial purification, image denoising, and out-of-distribution (OOD) detection, and can be trained at scale in high-dimensional data regimes. The core mathematical foundation links the energy function with the denoising operator, establishing a bridge between EBMs and score-based generative models.

1. Theoretical Foundation: Energy-Based Models and Denoising Score Matching

Energy-based models specify an unnormalized density of the form

$p_\theta(x) = \frac{\exp(-E_\theta(x))}{Z_\theta}$

where $E_\theta(x)$ is a learned scalar energy function and $Z_\theta$ is the (typically intractable) partition function. The training of EBMs by maximum likelihood is computationally prohibitive in high dimensions due to sampling requirements.

Denoising score matching (DSM) circumvents these difficulties by directly regressing the model's score (the gradient of the log density or negative energy gradient) to the score of a known corrupted data density. Specifically, for data corrupted by isotropic Gaussian noise $q_\sigma(\tilde{x}|x)=\mathcal{N}(\tilde{x};x,\sigma^2 I)$ , the DSM objective at noise level $\sigma$ is:

$J_{\mathrm{DSM}}(\theta) = \mathbb{E}_{p_{\mathrm{data}}(x)q_\sigma(\tilde{x}|x)} \left[ \|\nabla_{\tilde{x}} E_\theta(\tilde{x}) + \frac{\tilde{x}-x}{\sigma^2}\|^2 \right]$

Minimizing this loss ensures that $\nabla_{\tilde{x}} E_\theta(\tilde{x})$ learns the score of the corrupted data distribution $p_\sigma(\tilde{x})$ (Li et al., 2019), providing a tractable and stable alternative to maximum likelihood and forming the basis for denoising-EBM training.

Multi-scale denoising extends DSM over a set of noise levels $\{\sigma_j\}$ , addressing measure concentration and ensuring the learned score field is accurate over a broad region of sample space—critical in high-dimensional settings (Li et al., 2019).

2. Model Construction: From Denoising to Structured Energy Functions

Recent work has extended the denoising-EBM paradigm to build richer and more expressive energy functions by leveraging denoising architectures. In "How to Construct Energy for Images? Denoising Autoencoder Can Be Energy Based Model" (Zeng, 2023), the energy is explicitly decomposed into:

Texture Energy ( $E_{\mathrm{tex}}$ ): Defined as the scaled reconstruction error from a DAE or U-Net denoiser operating on Gaussian-corrupted images:

$E_\theta(x)$ 0

Semantic Energy ( $E_\theta(x)$ 1): Defined in the latent DAE code space, measuring mismatch between semantic decodings passed through denoising:

$E_\theta(x)$ 2

The total joint energy is $E_\theta(x)$ 3, with $E_\theta(x)$ 4 tied to the DAE encoder, enabling a modeling formalism that captures both global structure (semantics) and local detail (texture) (Zeng, 2023).

This energy parameterization yields a vector-valued energy output prior to reduction by squared norm, thus enabling high expressivity and preservation of local details—distinct from scalar-output EBMs.

3. Training Algorithms and Optimization

The denoising-EBM is optimized by maximizing a variational lower bound on the log density of noisy (corrupted) data, corresponding to multi-scale DSM with expectations under both the data and model distributions approximated by short-run Langevin MCMC (Zeng, 2023, Li et al., 2019).

For each noise level $E_\theta(x)$ 5 in a geometric sequence (e.g., $E_\theta(x)$ 6), the loss incorporates the denoising error at that scale.
Texture-EBM and Semantic-EBM components are trained alternately, with the latter also including a direct reconstruction loss.
The model distribution expectations are estimated via a two-stage MCMC: short-run sampling in latent space, followed by refinement in pixel space.

Vector outputs of the denoising network are aggregated via squared norms, forming a class of energies capable of modeling highly non-conservative residual fields while retaining the favorable optimization properties of score-matching.

4. Sampling and Inference Procedures

Denoising-EBMs employ sampling algorithms that leverage the learned score field for efficient generation and purification.

Two-Stage Langevin MCMC: First, latent codes $E_\theta(x)$ 7 are sampled with Langevin dynamics under the semantic energy to obtain high-level structure. Then, starting from decoded $E_\theta(x)$ 8, standard pixel-level Langevin refinement under $E_\theta(x)$ 9 adds realistic texture, as in:

$Z_\theta$ 0

This decouples global and local structure, accelerating mixing and improving sample fidelity (Zeng, 2023).

Adversarial Purification: A DSM-trained EBM's deterministic score updates can rapidly project adversarial examples back to the data manifold within 10–50 steps, using only the gradient of the score network, as opposed to thousands of MCMC steps required by earlier EBM purification methods (Yoon et al., 2021). Gaussian noise injection before purification further improves robustness and enables certified $Z_\theta$ 1 smoothing.
Moment-Matching Denoising Gibbs Sampling: For DSM-trained EBMs, a pseudo-Gibbs chain leveraging analytic first and second moments of the clean distribution recovers sharp samples and circumvents the mismatch between learned noisy and clean densities. This moment-matching is practically implemented using the Tweedie mean for conditional expectations (Zhang et al., 2023).

5. Empirical Performance and Applications

Denoising-EBMs demonstrate state-of-the-art or highly competitive results in several image modeling and robustness tasks:

Task	Denoising-EBM Result	Competing EBM Result
CIFAR-10 IS	7.86	IGEBM: 6.02, JEM: 8.76
CIFAR-10 FID	21.24	MDSM: 31.7, IGEBM: 40.58
CelebA FID (64x64)	14.1	BiDVL: 17.24
OOD AUROC (SVHN)	0.99	Others: 0.63–0.83

On adversarial purification, the denoising-EBM classifier pipeline achieves $Z_\theta$ 2 robust accuracy on CIFAR-10 for $Z_\theta$ 3-8/255 attacks (PGD blind), outperforming both long-run MCMC-EBMs and adversarial training (Yoon et al., 2021).

For OOD detection, the texture-based energy function achieves nearly perfect AUROC on SVHN and interpolations (0.99), substantially exceeding previous EBM baselines (Zeng, 2023).

In high-dimensional synthesis, multi-scale DSM-trained EBMs reach Inception/FID scores competitive with GAN-based models and provide explicit density estimates (Li et al., 2019).

6. Technical and Practical Insights

Energy Decomposition: Separating energy into semantic and texture terms aligns model optimization with both coarse structure and fine detail, improving both sample quality and mode coverage (Zeng, 2023).
Multi-scale Training: Integrating denoising losses over multiple noise scales prevents mode-dropping and concentration away from the data manifold (Li et al., 2019).
Vector Output Energies: Vector-valued denoising errors preserve high-frequency texture via U-Net skip connections and allow richer score fields than strictly scalar, conservative EBMs (Zeng, 2023).
Efficient Sampling: Two-phase MCMC (semantic→texture) dramatically reduces mixing time; moment-matching Gibbs further improves recovery of clean distribution (Zhang et al., 2023).
Robustness: Randomized Gaussian injection during purification increases adversarial robustness and facilitates certified $Z_\theta$ 4 smoothing (Yoon et al., 2021).

Boltzmann Machines and denoising autoencoders have a long-standing relationship to denoising-EBMs. Deep Gaussian–Bernoulli Boltzmann Machines can outperform autoencoders in high-noise denoising tasks, suggesting the general utility of energy-based approaches even in patchwise settings (Cho, 2013).
Plug-and-play methods for denoising in scientific imaging (e.g., beam micrographs) combine explicit noise modeling with ADMM-style inversion and black-box DAE denoisers, further demonstrating the flexibility of the denoising-based energy approach across domains (Peng et al., 2022).
In protein folding, a DSM-trained EBM operates on distance matrices, bypassing equivariant coordinate modeling and enabling fully differentiable, end-to-end folding pipelines competitive with state-of-the-art protocols (Wu et al., 2021).

Denoising-EBMs thus form a versatile, theoretically sound, and empirically strong modeling paradigm for high-dimensional generative modeling, adversarial purification, scientific imaging, and biomolecular structure prediction. The key technical motifs are learned score/energy fields via multi-scale denoising objectives, efficient and robust sampling, and the use of denoising architectures as energy surrogates.