Score-based Generative Models

Updated 17 September 2025

Score-based generative models are defined by learning the gradient of data log-density, enabling reverse-time simulation to generate new samples.
The latent formulation integrates VAEs to reduce dimensionality, improve expressivity, and model non-continuous data efficiently.
Innovative variance reduction, mixed parameterization, and reverse SDE techniques yield state-of-the-art sample quality and efficiency.

Score-based generative models (SGMs) are a leading paradigm for modern generative modeling, combining the principles of stochastic differential equations (SDEs), score-matching, and probabilistic inference. An SGM learns to approximate the gradient of the log-probability density (“score”) of a complex data distribution by modeling the time evolution of noise-corrupted samples, then utilizes this learned field to generate new samples by simulating the reverse dynamics. The recent work on Latent Score-based Generative Models (LSGM) (Vahdat et al., 2021) introduced a new framework that migrates SGMs from high-dimensional data space into a learned latent space, addressing both sample quality and computational inefficiency of classical approaches.

1. Latent Space Formulation and Key Model Structure

LSGM lifts SGMs into latent space by coupling a variational autoencoder (VAE) with a score-based prior defined through a reverse-time SDE. The generative process becomes:

The data point $x \in \mathbb{R}^D$ is encoded to a latent code $z_0$ by the encoder $q(z_0|x)$ .
A forward latent diffusion process generates $z_t$ for $t \in [0,1]$ by

$dz_t = f(t)\,z_t\,dt + g(t)\,dW_t,$

where $f$ and $g$ parameterize the SDE and $W_t$ is Brownian motion.

The reverse process, parameterized by a learned neural score function $s_t(z_t)$ , reconstructs clean latent variables.

This VAE-based split offers three essential advantages:

Expressivity: The learned latent space can be shaped to accommodate intricate modes and semantics.
Dimensionality reduction: Lower dimensionality in latent space eases training and allows SGMs to focus on minor discrepancies between $q(z_0)$ (the VAE’s aggregate posterior) and a standard normal prior.
Non-continuous data: The decoder $p(x|z_0)$ enables direct modeling of binary, discrete, or structured data via dedicated output heads (e.g., Bernoulli or mixture of logistics).

2. Novel Score-Matching Objective for Latent Diffusions

Standard SGM training employs denoising score-matching (DSM), typically:

$\mathbb{E}_t\big[\lambda(t)\, \mathbb{E}_{q(x)q(x_t|x)}\|\nabla_{x_t}\log q(x_t) - s_t(x_t)\|^2\big]$

For LSGM, this is inapplicable since the marginal latent $q(z_t)$ is not available in closed-form due to the evolving encoder. The LSGM objective reformulates the cross-entropy between the aggregate posterior $q(z_0)$ and the SGM prior $p(z_0)$ as:

$\operatorname{CE}(q(z_0) \| p(z_0)) = \mathbb{E}_{t \sim U[0,1]}\Big[\frac{g(t)^2}{2}\,\mathbb{E}_{q(z_t, z_0)}\|\nabla_{z_t} \log q(z_t|z_0) - \nabla_{z_t} \log p(z_t)\|^2\Big] + \frac{D}{2}\log(2\pi e)$

where both conditional scores are tractable, enabling scalable and maximum-likelihood-aligned joint training of the VAE and latent SGM.

3. Mixed Parameterization of the Score Function

Rather than requiring the network to learn $\nabla_{z_t}\log p(z_t)$ directly, LSGM introduces a mixed parameterization:

$s_t(z_t) = \sigma_t(1-\alpha) \odot z_t + \alpha\odot s_t'(z_t,t)$

Here, $\sigma_t$ is the current noise scale and $\alpha\in[0,1]$ is a (potentially learned) mixing coefficient. The network outputs only the residual correction required to bridge $q(z_0)$ and the standard normal prior. This decomposition exploits the effectiveness of VAE pretraining to ensure most of the prior mismatch is nearly eliminated, resulting in a score function that is nearly linear and smooth, improving sample quality and efficiency.

4. Variance Reduction Techniques in Training

Estimating the DSM loss for LSGM over continuous diffusion times suffers from high variance, leading to instability. LSGM proposes three methods to reduce this variance:

Geometric VPSDE: A variance-preserving SDE schedule chosen so that $d(\log \sigma^2_t)/dt$ is constant over $t$ , yielding a time-homogeneous integrand for the score-matching loss.
Importance Sampling: The optimal distribution for $t$ is derived to minimize estimation variance; in particular, the importance weights are matched to integrand magnitude for each loss variant and sampling is performed via inverse transform methods.
Reweighting Schemes: For maximum-likelihood versus other weightings, optimal $t$ -proposal distributions are designed for further variance reduction.

These strategies jointly enable stable end-to-end training, fewer network evaluations, and improved scalability.

5. Empirical Sample Quality and Efficiency

LSGM achieves state-of-the-art performance across multiple contexts:

On CIFAR-10: FID $=2.10$ , surpassing prior diffusion-based and generative adversarial models.
On CelebA-HQ-256: Comparable FID to leading pixel-space SGMs, but with 20–30 neural function evaluations (NFE) instead of thousands.
On OMNIGLOT and binarized MNIST: New best likelihoods, with natural adaptation to binary image modeling.

The acceleration in sampling arises from the lower latent dimensionality and nearly linear residual flows, allowing integration with large ODE steps without loss of sample fidelity or coverage.

6. Model Applicability and Extensions

By leveraging flexible encoder–decoder choices, LSGM generalizes to:

Non-continuous data: Through Bernoulli or other non-Gaussian decoders.
Modalities beyond images: If suitable latent representations exist, music, sequences, graphs, or text can be modeled.
Semi-supervised and representation learning: The VAE backbone provides efficient inference and supports tasks reliant on learnt latent codes.
Resource-constrained generative tasks: Orders of magnitude reduction in NFE make LSGM suitable for high-throughput and real-time applications.

7. Practical and Theoretical Implications

The LSGM framework unifies maximal-likelihood generative modeling with scalable, flexible, and computationally efficient sampling by fusing latent-variable inference and continuous-time score matching:

The cross-entropy-based denoising objective avoids the intractability of marginal scores prevalent in hierarchical latent models.
The mixed score parameterization targets only the true prior mismatch rather than re-learning standard Gaussian scores.
Stable and practical training ensues from geometric noise scheduling and optimal importance sampling for time selection.

LSGM’s demonstrated empirical superiority on sample quality and speed, along with architectural flexibility, underscore latent diffusion as the leading strategy for SGM deployment in high-performance, multimodal generative modeling.

PDF Markdown Chat (Pro)

References (1)

Score-based Generative Modeling in Latent Space (2021)

Follow Topic

Get notified by email when new papers are published related to Score-based Generative Model (SGM).