VAE-based Latent SGM Framework

Updated 14 December 2025

The paper presents a VAE-based latent SGM framework that integrates variational autoencoders with nonlinear latent SDEs using LNDSM loss to enhance generative modeling.
The methodology leverages learned latent representations, drift control via score networks, and variance stabilization to improve metrics like FID, IS, and mode coverage.
Empirical results on MNIST show that LNDSM-SGM achieves superior synthesis quality and speed compared to traditional LSGM and full-space SGMs, especially in low-data regimes.

A VAE-based latent SGM framework is a generative modeling paradigm that integrates variational autoencoders (VAEs) with score-based generative models (SGMs) operating in latent space. This approach exploits the compactness and structure of learned latent representations to facilitate efficient, expressive generative modeling. Recent work extends this framework by incorporating nonlinear latent stochastic differential equations (SDEs) and advanced score-matching objectives, such as the Latent Nonlinear Denoising Score Matching (LNDSM) loss, to enhance the modeling of structured, multimodal distributions and improve sampling quality and efficiency (Shen et al., 7 Dec 2025, Vahdat et al., 2021).

1. Model Components and Latent Diffusion Structure

The canonical VAE-based latent SGM framework consists of three main components:

Encoder $q_{\phi}(z_0|x)$ : Maps a data point $x$ to a latent code $z_0 \in \mathbb{R}^{d_{\text{lat}}}$ , typically using a deep neural network to capture the data's salient features.
Decoder $p_{\psi}(x|z_0)$ : Reconstructs $x$ from latent $z_0$ , enabling end-to-end training via reconstruction error.
Score-based latent prior $p_{\theta}(z_0)$ : Rather than using a simple analytic prior (e.g., Gaussian), this prior is realized via a learned SGM defined by an SDE in latent space:

$dz_t = f(z_t, t) dt + g(t) dw_t, \quad t \in [0,T]$

where $f(z,t)$ controls drift toward a target invariant measure (e.g., a Gaussian Mixture Model, GMM), and $g(t)$ modulates noise injection.

A score network $s_{\theta}(z_t, t)$ parametrizes the gradient of the time-dependent log-density $\nabla_{z_t} \log p_t(z_t)$ , and is learned to approximate the prior score at varying diffusion timepoints. The encoder, decoder, and score network are optimized jointly using a modified evidence lower bound (ELBO), where the latent prior's KL divergence is replaced by a denoising-based score-matching term (Shen et al., 7 Dec 2025, Vahdat et al., 2021).

2. LNDSM Loss: Derivation and Stability

The core innovation introduced by LNDSM is the reformulation of the ELBO's latent-prior cross-entropy using an SDE with nonlinear drift, accounting for structure in latent space:

$\mathcal{L}(x; \phi, \theta, \psi) = \mathbb{E}_{z_0 \sim q_{\phi}(z_0|x)}[ -\log p_{\psi}(x|z_0) ] + \mathbb{E}_{z_0 \sim q_{\phi}(z_0|x)}[ \log q_{\phi}(z_0|x) ] + CE(q_{\phi}(z_0|x) \| p_{\theta}(z_0))$

The cross-entropy $CE(q_0 \| p_0)$ is, after time-discretization (Euler–Maruyama with $n_f$ steps of $\Delta t_n$ ), given by (Shen et al., 7 Dec 2025):

$CE(q_0 \| p_0) \approx H(q_{n_f}) + \frac{1}{2} \mathbb{E}_{N, z_N}[ g^2(t_N) \| s_\theta(z_N, t_N) \|^2 ] \ + \mathbb{E}_{N,z_{N-1},z_N}\Big[ g^2(t_N) \frac{U_{N-1}}{\sigma_{N-1}} [ s_{\theta}(z_N, t_N) - s_{\theta}(\mu_{N-1}, t_N) ] \ - \frac{U_{N-1}}{\sigma_{N-1}} [ f(z_N, t_N) - f(\mu_{N-1}, t_N) ] \Big]$

where $U_{N-1} \sim \mathcal{N}(0, I)$ , $\sigma_{N-1} = g(t_{N-1}) \sqrt{\Delta t_{N-1}}$ , and $\mu_{N-1} = z_{N-1} + f(z_{N-1}, t_{N-1}) \Delta t_{N-1}$ .

To achieve numerical stability, the loss eliminates variance-exploding terms (control variates) originating from the discretized SDE: each is zero-mean but scales as $1/\sqrt{\Delta t}$ and would otherwise destabilize optimization as $\Delta t \to 0$ (Shen et al., 7 Dec 2025).

3. Implementation and Architectural Details

Key architectural and training details for state-of-the-art systems employing this framework include:

Structured latent reference: A GMM, with components corresponding to class labels (e.g., 10 modes for MNIST), is adopted as the reference distribution in latent space.
Drift/score parametrizations: The drift $f(z, t)$ is set as the negative gradient of a reference potential $V(z) = -\log \pi(z)$ , ensuring the target invariant is achieved. The score network is typically a downsized U-Net (e.g., NCSN++ variant), and the VAE is built on a simplified NVAE backbone.
Typical parameter counts: For MNIST-scale tasks: $\sim 207$ k parameters (VAE), $\sim 895$ k (score net), totaling $\sim 1.1$ M.
Training protocol:
- Pretrain the VAE under the latent diffusion prior.
- Fit GMM in latent space; optionally fine-tune encoder at low learning rate to preserve mode structure.
- Jointly train decoder and score network under LNDSM loss, with batch sizes and step counts chosen for efficiency (e.g., batch size 60, 100 steps).

Notably, no importance sampling over diffusion times is necessary in LNDSM, simplifying optimization relative to earlier LSGM variants (Shen et al., 7 Dec 2025, Vahdat et al., 2021).

4. Relation to Other Latent SGM Variants

LSGM (Vahdat et al., 2021) originally proposed training SGMs directly on VAE latents, optimizing a denoising score-matching objective adapted to the latent manifold, with various variance-reduction techniques (geometric VPSDE, importance sampling in diffusion time) to promote scalable, stable optimization. LNDSM generalizes this approach by allowing nonlinear (and learnable) latent drift, which enables targeting multimodal or symmetric latent priors such as GMMs, and by introducing a refined variance-stabilizing loss (Shen et al., 7 Dec 2025).

Other approaches, such as diffusion-based VampPriors (Kuzina et al., 2 Dec 2024), act over VAE context variables or hierarchical latents, but do not instantiate a nonlinear SDE-based learned prior on the full latent representation; their diffusion dynamics focus on auxiliary variables and do not exploit explicit latent manifold structure.

5. Empirical Performance and Advantages

LNDSM has demonstrated state-of-the-art performance on structured and symmetric data sets:

MNIST full data ($60$k samples, FID $\downarrow$ /IS $\uparrow$ ):

NDSM-SGM: $36.1$ / $8.93$
LSGM: $6.7$ / $9.42$
LNDSM-SGM: $2.2$ / $9.75$

MNIST low-data ($3$k samples):

LSGM: $27.1$ / $9.12$
LNDSM-SGM: $15.0$ / $9.57$

Mode balance (KL divergence):

LSGM: $0.008$
LNDSM-SGM: $0.0011$

Sampling speed (batch):

Latent EM (128-dim, 100 steps): $0.068$s
Full-space NDSM-SGM (784-dim, 1000 steps): $0.35$s

LNDSM thus offers substantial improvements in FID, IS, and mode coverage metrics compared to baseline latent and full-space SGMs, with orders-of-magnitude gains in synthesis speed. This suggests that the explicit integration of structured priors and variance stabilization techniques in latent score matching is critical for scalable, high-fidelity structured generative modeling (Shen et al., 7 Dec 2025).

6. Significance and Comparative Structural Insights

The principal insight of the VAE-based latent SGM framework is that learning and sampling are substantially more efficient and stable when the prior is learned in latent space rather than pixel space. This is enabled both by the smoothness and lower dimensionality of VAEs' latent manifolds and by variance reduction mechanisms in the SDE-driven score matching loss (Vahdat et al., 2021). LNDSM extends these advantages to structured and multimodal distributions by enabling nonlinear latent dynamics, demonstrating robustness in low-data regimes and improved diversity.

In contrast, diffusion-based VampPrior approaches target flexible priors over context or auxiliary variables, with strengths in hierarchical aggregation and stability (Kuzina et al., 2 Dec 2024). LSGM and LNDSM, however, are specifically designed to operationalize the SGM machinery on the full VAE latent space, incorporating prior structure directly and learning denoising dynamics tailored to the geometry and symmetries of the data manifold.

7. Concluding Remarks

VAE-based latent SGM frameworks, particularly those leveraging LNDSM, define a rigorous, empirically validated pathway for scalable, high-quality structured generative modeling. They synthesize advances in latent variable modeling, denoising score matching, and diffusion processes to yield efficient training and sampling, superior sample quality, and precise distributional coverage—especially when explicit structure (e.g., multimodal, symmetric priors) is essential (Shen et al., 7 Dec 2025, Vahdat et al., 2021).