Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 173 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 20 tok/s Pro
GPT-5 High 23 tok/s Pro
GPT-4o 76 tok/s Pro
Kimi K2 202 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Score-based Generative Models

Updated 17 September 2025
  • Score-based generative models are defined by learning the gradient of data log-density, enabling reverse-time simulation to generate new samples.
  • The latent formulation integrates VAEs to reduce dimensionality, improve expressivity, and model non-continuous data efficiently.
  • Innovative variance reduction, mixed parameterization, and reverse SDE techniques yield state-of-the-art sample quality and efficiency.

Score-based generative models (SGMs) are a leading paradigm for modern generative modeling, combining the principles of stochastic differential equations (SDEs), score-matching, and probabilistic inference. An SGM learns to approximate the gradient of the log-probability density (“score”) of a complex data distribution by modeling the time evolution of noise-corrupted samples, then utilizes this learned field to generate new samples by simulating the reverse dynamics. The recent work on Latent Score-based Generative Models (LSGM) (Vahdat et al., 2021) introduced a new framework that migrates SGMs from high-dimensional data space into a learned latent space, addressing both sample quality and computational inefficiency of classical approaches.

1. Latent Space Formulation and Key Model Structure

LSGM lifts SGMs into latent space by coupling a variational autoencoder (VAE) with a score-based prior defined through a reverse-time SDE. The generative process becomes:

  • The data point xRDx \in \mathbb{R}^D is encoded to a latent code z0z_0 by the encoder q(z0x)q(z_0|x).
  • A forward latent diffusion process generates ztz_t for t[0,1]t \in [0,1] by

dzt=f(t)ztdt+g(t)dWt,dz_t = f(t)\,z_t\,dt + g(t)\,dW_t,

where ff and gg parameterize the SDE and WtW_t is Brownian motion.

  • The reverse process, parameterized by a learned neural score function st(zt)s_t(z_t), reconstructs clean latent variables.

This VAE-based split offers three essential advantages:

  • Expressivity: The learned latent space can be shaped to accommodate intricate modes and semantics.
  • Dimensionality reduction: Lower dimensionality in latent space eases training and allows SGMs to focus on minor discrepancies between q(z0)q(z_0) (the VAE’s aggregate posterior) and a standard normal prior.
  • Non-continuous data: The decoder p(xz0)p(x|z_0) enables direct modeling of binary, discrete, or structured data via dedicated output heads (e.g., Bernoulli or mixture of logistics).

2. Novel Score-Matching Objective for Latent Diffusions

Standard SGM training employs denoising score-matching (DSM), typically:

Et[λ(t)Eq(x)q(xtx)xtlogq(xt)st(xt)2]\mathbb{E}_t\big[\lambda(t)\, \mathbb{E}_{q(x)q(x_t|x)}\|\nabla_{x_t}\log q(x_t) - s_t(x_t)\|^2\big]

For LSGM, this is inapplicable since the marginal latent q(zt)q(z_t) is not available in closed-form due to the evolving encoder. The LSGM objective reformulates the cross-entropy between the aggregate posterior q(z0)q(z_0) and the SGM prior p(z0)p(z_0) as:

CE(q(z0)p(z0))=EtU[0,1][g(t)22Eq(zt,z0)ztlogq(ztz0)ztlogp(zt)2]+D2log(2πe)\operatorname{CE}(q(z_0) \| p(z_0)) = \mathbb{E}_{t \sim U[0,1]}\Big[\frac{g(t)^2}{2}\,\mathbb{E}_{q(z_t, z_0)}\|\nabla_{z_t} \log q(z_t|z_0) - \nabla_{z_t} \log p(z_t)\|^2\Big] + \frac{D}{2}\log(2\pi e)

where both conditional scores are tractable, enabling scalable and maximum-likelihood-aligned joint training of the VAE and latent SGM.

3. Mixed Parameterization of the Score Function

Rather than requiring the network to learn ztlogp(zt)\nabla_{z_t}\log p(z_t) directly, LSGM introduces a mixed parameterization:

st(zt)=σt(1α)zt+αst(zt,t)s_t(z_t) = \sigma_t(1-\alpha) \odot z_t + \alpha\odot s_t'(z_t,t)

Here, σt\sigma_t is the current noise scale and α[0,1]\alpha\in[0,1] is a (potentially learned) mixing coefficient. The network outputs only the residual correction required to bridge q(z0)q(z_0) and the standard normal prior. This decomposition exploits the effectiveness of VAE pretraining to ensure most of the prior mismatch is nearly eliminated, resulting in a score function that is nearly linear and smooth, improving sample quality and efficiency.

4. Variance Reduction Techniques in Training

Estimating the DSM loss for LSGM over continuous diffusion times suffers from high variance, leading to instability. LSGM proposes three methods to reduce this variance:

  • Geometric VPSDE: A variance-preserving SDE schedule chosen so that d(logσt2)/dtd(\log \sigma^2_t)/dt is constant over tt, yielding a time-homogeneous integrand for the score-matching loss.
  • Importance Sampling: The optimal distribution for tt is derived to minimize estimation variance; in particular, the importance weights are matched to integrand magnitude for each loss variant and sampling is performed via inverse transform methods.
  • Reweighting Schemes: For maximum-likelihood versus other weightings, optimal tt-proposal distributions are designed for further variance reduction.

These strategies jointly enable stable end-to-end training, fewer network evaluations, and improved scalability.

5. Empirical Sample Quality and Efficiency

LSGM achieves state-of-the-art performance across multiple contexts:

  • On CIFAR-10: FID =2.10=2.10, surpassing prior diffusion-based and generative adversarial models.
  • On CelebA-HQ-256: Comparable FID to leading pixel-space SGMs, but with 20–30 neural function evaluations (NFE) instead of thousands.
  • On OMNIGLOT and binarized MNIST: New best likelihoods, with natural adaptation to binary image modeling.

The acceleration in sampling arises from the lower latent dimensionality and nearly linear residual flows, allowing integration with large ODE steps without loss of sample fidelity or coverage.

6. Model Applicability and Extensions

By leveraging flexible encoder–decoder choices, LSGM generalizes to:

  • Non-continuous data: Through Bernoulli or other non-Gaussian decoders.
  • Modalities beyond images: If suitable latent representations exist, music, sequences, graphs, or text can be modeled.
  • Semi-supervised and representation learning: The VAE backbone provides efficient inference and supports tasks reliant on learnt latent codes.
  • Resource-constrained generative tasks: Orders of magnitude reduction in NFE make LSGM suitable for high-throughput and real-time applications.

7. Practical and Theoretical Implications

The LSGM framework unifies maximal-likelihood generative modeling with scalable, flexible, and computationally efficient sampling by fusing latent-variable inference and continuous-time score matching:

  • The cross-entropy-based denoising objective avoids the intractability of marginal scores prevalent in hierarchical latent models.
  • The mixed score parameterization targets only the true prior mismatch rather than re-learning standard Gaussian scores.
  • Stable and practical training ensues from geometric noise scheduling and optimal importance sampling for time selection.

LSGM’s demonstrated empirical superiority on sample quality and speed, along with architectural flexibility, underscore latent diffusion as the leading strategy for SGM deployment in high-performance, multimodal generative modeling.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Score-based Generative Model (SGM).