Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Energy-Based Models (EBMs)

Updated 6 July 2025
  • Energy-Based Models (EBMs) are probabilistic models that assign lower energy to high-probability data configurations using neural network parameterizations.
  • They employ MCMC-based techniques, such as Langevin dynamics, to approximate gradients and sample efficiently without computing intractable partition functions.
  • EBMs are applied in image generation, OOD detection, adversarial robustness, and continual learning, offering robust performance over traditional likelihood and GAN approaches.

Energy-Based Models (EBMs) are a class of probabilistic models that define a probability distribution over data configurations via a scalar energy function. The central principle underlying EBMs is the association of lower energies with higher probabilities, with densities expressed in Boltzmann–Gibbs form as pθ(x)=exp(Eθ(x))/Zθp_\theta(x) = \exp(-E_\theta(x)) / Z_\theta where Eθ(x)E_\theta(x) is a learnable energy function, typically parameterized by a neural network, and ZθZ_\theta is the partition function that ensures the model can be interpreted probabilistically. Unlike explicit or normalized models, EBMs do not require tractable computation of ZθZ_\theta, which grants considerable modeling flexibility at the cost of significant computational and algorithmic challenges, particularly during training and sampling (1903.08689).

1. Probabilistic Formulation and Energy Function Parameterization

In EBMs, the energy function Eθ(x)E_\theta(x) defines an unnormalized log-probability landscape over the observation space. This function can be highly expressive and is commonly implemented by deep neural networks, allowing for the capture of complex, high-dimensional, and multimodal data distributions. The key density formula is:

pθ(x)=exp(Eθ(x))Z(θ),Z(θ)=exp(Eθ(x))dxp_\theta(x) = \frac{\exp(-E_\theta(x))}{Z(\theta)}, \quad Z(\theta) = \int \exp(-E_\theta(x))\, dx

This modeling framework enables the direct incorporation of domain knowledge or compositional structure via the summation of energy functions—properties that have no direct analogue in feedforward generator models. For high-dimensional domains such as ImageNet, CIFAR-10, or robotic trajectory spaces, the approach provides a probabilistic model that can theoretically cover all modes of the data distribution (1903.08689).

2. Training Procedures and Algorithmic Strategies

Maximum Likelihood and MCMC-Based Sampling

The log-likelihood maximization objective leads to gradients of the form:

θLML=Ex+pD[θEθ(x+)]Expθ[θEθ(x)]\nabla_\theta \mathcal{L}_{ML} = \mathbb{E}_{x^+ \sim p_D}[\nabla_\theta E_\theta(x^+)] - \mathbb{E}_{x^- \sim p_\theta}[\nabla_\theta E_\theta(x^-)]

Here, x+x^+ are data samples and xx^- are negative samples from the current model, which must be efficiently generated due to the intractable Z(θ)Z(\theta). Gradient-based Markov Chain Monte Carlo (MCMC) methods are employed—specifically, Langevin dynamics:

x(k)=x(k1)λ2xEθ(x(k1))+ω(k),ω(k)N(0,λ)x^{(k)} = x^{(k-1)} - \frac{\lambda}{2} \nabla_x E_\theta(x^{(k-1)}) + \omega^{(k)}, \quad \omega^{(k)} \sim \mathcal{N}(0, \lambda)

As KK \to \infty (with small λ\lambda), this process samples from pθ(x)p_\theta(x). For scalability, persistent chains and a replay buffer are used, reinitializing MCMC chains from previously sampled states to improve mixing and mode coverage. Spectral normalization is applied to the energy function to bound its Lipschitz constant and stabilize the gradients, supplemented by a weak L2L_2 penalty on the energy output to guarantee integrability and numerical stability (1903.08689).

Training and Regularization Workflow

The typical workflow for large-scale EBM training includes:

  • Defining Eθ(x)E_\theta(x) as a multi-layer neural network (with spectral normalization)
  • Using persistent MCMC with sample replay buffers for efficient negative sampling
  • Applying spectral normalization and L2L_2 regularization during optimization
  • Updating θ\theta via stochastic gradient descent using the sample-based gradient estimate

This enables the training of such models on challenging domains such as ImageNet (both 32×3232{\times}32 and 128×128128{\times}128), CIFAR-10, and robotic hand trajectory data.

3. Implicit Sample Generation and Unique Capabilities

A distinguishing feature of EBMs is implicit generation: samples are generated not by a deterministic feedforward process but by MCMC-based optimization toward regions of low energy. This permits:

  • Compositionality: Energy functions can be summed to combine distributions, effectively modeling a product-of-experts. For instance, composing separate models for shape and color enables the generation of images with novel combinations, without explicit generator architectures for each factor (1903.08689).
  • Corrupt Image Reconstruction and Inpainting: Initiating Langevin sampling from corrupted or masked images, the iterative process naturally denoises or inpaints, moving the input toward a mode of the data distribution. This approach is not tied to single deterministic completions and produces diverse, plausible reconstructions (1903.08689).

4. Empirical Performance and Comparison to Likelihood Models and GANs

EBMs have demonstrated strong empirical results on datasets such as CIFAR-10 and ImageNet. They consistently yield higher sample fidelity than other likelihood models (such as PixelCNN, PixelIQN) and closely approach state-of-the-art GANs in image quality. Notably, EBMs show superior mode coverage: because negative sampling via MCMC encourages the model to allocate nonzero likelihood across all observed modes, EBMs effectively avoid mode collapse, a persistent problem in GAN-based models. This results in favorable Inception and FID scores and allows for robust differentiation between in-distribution and out-of-distribution data (1903.08689).

5. Advanced Applications: Out-of-Distribution Detection, Robustness, and Continual Learning

Out-of-Distribution (OOD) Detection

As the negative sampling mechanism forces the energy function to assign high energies (low likelihoods) outside the data manifold, EBMs excel at OOD detection, outperforming typical likelihood models in identifying atypical or adversarial inputs.

Adversarial Robustness

Training EBMs with MCMC-based negative sampling yields models with robust gradients. Under adversarial attacks (e.g., PGD), EBMs show notable resistance, attributed to the implicit, diffusion-driven generation process that mitigates the sharp, feedforward gradient pathways that adversaries exploit in other models.

Continual Online Class Learning

During continual learning scenarios (e.g., Split MNIST), conditional EBMs update only with respect to the currently active class distribution, resulting in local, rather than global, forgetting. This permits the addition of new classes with minimal interference with previously learned data—surpassing methods such as EWC or SI and outperforming baseline VAEs (1903.08689).

Long-Term Sequential Prediction

EBMs, trained for trajectory data, demonstrate the ability to generate coherent long-term rollouts, even under multimodal transitions as in robotic hand manipulation. The iterative sampling process allows for stabilization and refinement over many timesteps, leading to reduced trajectory error (as measured by Fréchet distance) when compared to feedforward dynamical predictors (1903.08689).

6. Summary and Ongoing Challenges

The advances in (1903.08689) demonstrate that EBMs, when equipped with scalable MCMC sampling, persistent replay, and principled regularization, can be trained successfully on continuous neural networks for diverse, high-dimensional tasks. They offer a robust alternative to explicit generative models, with advantages in sample diversity, compositionality, OOD detection, adversarial robustness, and continual learning. Key technical contributions include the design of efficient Langevin-based sampling procedures, integration of neural spectral normalization, and the development of training frameworks that both stabilize optimization and generalize across domains.

Continued challenges for EBMs entail computational efficiency, particularly in the negative sampling steps; scaling MCMC to even higher dimensions; and integrating structural prior knowledge into the energy landscape without compromising model expressivity. Nonetheless, the evidence supports EBMs as a versatile and performant foundation for both generative and discriminative modeling in high-dimensional, structured domains.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)