Latent Video Diffusion Model (LVDM)

Updated 2 April 2026

LVDM is a generative video model that compresses videos using a variational autoencoder and applies denoising diffusion in a low-dimensional latent space for scalable synthesis.
It employs a two-stage process: first, encoding videos with spatiotemporal VAEs, and then refining latent representations via diffusion, significantly reducing computational overhead.
LVDMs enable versatile applications such as text-to-video generation, frame interpolation, and long-form video synthesis through robust hierarchical conditioning and guidance techniques.

A Latent Video Diffusion Model (LVDM) is a class of generative video models that integrates a video variational autoencoder (VAE) with a denoising diffusion process in a compressed, low-dimensional latent space. Operating in this space enables the efficient modeling of high-resolution, long-duration, or perceptually complex video content that would be computationally prohibitive in pixel space. LVDMs are structured around the two-stage workflow of (1) mapping input videos to a compact latent representation using a video VAE, and (2) training a denoising diffusion probabilistic model (DDPM) to generate or manipulate videos within this latent space. This paradigm underpins a wide range of state-of-the-art video synthesis systems, including text-to-video, video frame interpolation, super-resolution, and long-form video generation.

1. Core Architecture: Latent Autoencoding and Diffusion Process

The LVDM pipeline centers on a VAE that encodes a video $\mathbf{x} \in \mathbb{R}^{T \times H \times W \times 3}$ to latents $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ ; $\mathbf{z}_0$ is then the target of a diffusion generative model. Typical choices for the autoencoder backbone are convolutional VAEs, hierarchical VQ-VAEs, or specialized spatiotemporal architectures. Compression rates often reach 8–32× spatially and 2–4× temporally.

The forward (noising) process in latent space, as standardized in DDPMs, is formulated as

$q(\mathbf{z}_t \mid \mathbf{z}_{t-1}) = \mathcal{N}(\mathbf{z}_t; \sqrt{1-\beta_t}\,\mathbf{z}_{t-1},\,\beta_t I)$

with $(\beta_t)$ a variance schedule, and typically with $z_0$ as the original VAE-encoded latent. The reverse step is then learned by a deep denoising network: $\mathbf{z}_{t-1} = \frac{1}{\sqrt{\alpha_t}} \left( \mathbf{z}_t - \frac{1-\alpha_t}{\sqrt{1-\bar{\alpha}_t}} \epsilon_\theta(\mathbf{z}_t, t) \right) + \sigma_t \zeta, \quad \zeta \sim \mathcal{N}(0, I)$ where $\epsilon_\theta$ predicts the added noise. The diffusion model operates on $\mathbf{z} \in \mathbb{R}^{T' \times H' \times W' \times C}$ , implemented using either 3D convolutional U-Nets or hybrid spatiotemporal transformers with temporal convolution and attention layers (Blattmann et al., 2023, Wu et al., 2024, He et al., 2022). At inference, a sample $\mathbf{z}_T \sim \mathcal{N}(0, I)$ is iteratively denoised, then decoded: $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 0.

2. Variational Autoencoder Design for Latent Space Compression

The choice and engineering of the video VAE critically determine the efficiency and upper-bound fidelity of the LVDM. Multiple architectural innovations have emerged:

Omni-dimensional video VAE (OD-VAE): Simultaneous compression in spatial and temporal axes using 3D-causal convolutions, with multiple variants trading off between 2D and 3D convs for efficiency and quality. Techniques like tail initialization from 2D VAE weights and overlap-and-drop chunking address training speed and memory constraints (Chen et al., 2024).
Wavelet-flow VAE (WF-VAE): Integrates multi-level 3D Haar wavelet decomposition, focusing channel bandwidth on low-frequency subbands (main "energy flow" pathway) and using inflow/outflow blocks to bypass heavy 3D convs for efficiency. Supports lossless block-wise inference via a causal cache system (Li et al., 2024).
LeanVAE: Employs non-overlapping 3D patch embedding, ultra-lightweight neighborhood-aware feedforward (NAF) modules, and compressed sensing (CS) bottlenecks, reducing FLOPs and memory footprint by more than $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 1 while maintaining competitive reconstruction (Cheng et al., 18 Mar 2025).
Improved Video VAE (IV-VAE): Decomposes latents into a keyframe-inherited branch (initialized from a 2D image VAE) and a temporal-branch (3D group causal convolution), achieving better spatial-temporal trade-offs and faster convergence (Wu et al., 2024).
Spectral-Structured VAE (SSVAE): Regularizes VAE latents with local correlation regularization (LCR) to bias power spectral density toward low frequencies, and latent masked reconstruction (LMR) to enforce a few-mode channel eigenspectrum, accelerating diffusion convergence and improving downstream sample quality (Liu et al., 5 Dec 2025).

A summary of VAE innovations is organized below:

VAE Variant	Key Feature	Efficiency Highlight
OD-VAE (Chen et al., 2024)	Omni-dim. comp, tail-init	2× memory, 2× speed vs. baseline
WF-VAE (Li et al., 2024)	Wavelet energy-flow, causal cache	4× memory, 2× speed vs. OD-VAE
LeanVAE (Cheng et al., 18 Mar 2025)	Patch+NAF+CS, DWT	44× faster, 50× fewer FLOPs
IV-VAE (Wu et al., 2024)	Keyframe+temporal split (GCConv)	2× faster, SOTA recon., ~½ params
SSVAE (Liu et al., 5 Dec 2025)	Spectral bias (LCR+LMR)	3× faster diffusion train, ↑reward

3. Conditioning and Guidance in Latent Video Diffusion

LVDMs support multiple forms of temporal and semantic conditioning:

Conditional Video Generation: Text/image-conditioned LVDMs use CLIP embeddings, prompt encodings, or direct image features injected into the denoising U-Net using cross-attention along both spatial and temporal axes (Blattmann et al., 2023, Blattmann et al., 2023, Reynaud et al., 2024).
Frame Interpolation: LDMVFI conditions the diffusion model on the two neighboring-frame latents, interpolating the target frame as a conditional generative task. Conditioning latents are input by concatenation or cross-attention at multiple U-Net stages (Danier et al., 2023).
Hierarchical Sampling: For long-form or high-frame-rate generation, hierarchical strategies employ a mask-conditional diffusion model for keyframe prediction followed by interpolation diffusion for intermediate frames, enabling $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 2-frame synthesis with moderate drift (He et al., 2022).
Perceptual and GAN Losses: Reconstruction quality is regulated using LPIPS, adversarial patch-GAN, or frequency-domain losses; recent work emphasizes that pixel MSE/PSNR is a poor proxy for perceptual fidelity in video, necessitating generative/contrastive objectives (Danier et al., 2023, Li et al., 2024, Liu et al., 5 Dec 2025).
Robustness Enhancement: Corruption-Aware Training (CAT-LVDM) injects structured, low-rank noise (Batch-Centered Noise Injection and Spectrum-Aware Contextual Noise) into conditioning embeddings, improving generative consistency under noisy prompts (Maduabuchi et al., 24 May 2025).

4. Denoising Network and Temporal Modeling

The denoising backbone in modern LVDMs is typically a spatiotemporal U-Net or transformer, with the following features:

Temporal Layering: After each spatial block, temporal convolutions or attention modules are inserted. 3D residual blocks (e.g., $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 3 conv) and temporal self-attention facilitate cross-frame dependency modeling, sometimes with sine or learned positional encodings (Blattmann et al., 2023, Blattmann et al., 2023, Reynaud et al., 2024).
MaxViT and Efficient Attention: LDMVFI employs multi-axis MaxViT-based self-attention in the latent diffusion U-Net for $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 4 scaling while maintaining performance (Danier et al., 2023).
Denoising Loss: Training universally utilizes the noise prediction loss on latents: $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 5 Hybrid models (e.g., JVID) combine latent image- and video-diffusion models during sampling for improved spatiotemporal trade-offs (Reynaud et al., 2024).
Temporal Smoothing and Guidance: To further reduce flicker, adaptive smoothing or entropy-reduction is applied to latent sequences post-denoising (Reynaud et al., 2024, Gu et al., 2023).

5. Efficiency, Scaling, and Training Protocols

LVDMs are designed to address scaling bottlenecks in video synthesis:

Compressed Latent Space: Reduces token count and computational overhead by an order of magnitude compared to pixel-space diffusion—e.g., $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 6 latents vs $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 7 pixels (He et al., 2022).
Block-wise and Causal Encoding: WF-VAE and other models implement causal convolutions and caching strategies for lossless block-wise inference on arbitrarily long videos, eliminating temporal discontinuities observed with naive chunking (Li et al., 2024, Chen et al., 2024).
Hierarchical Sampling: Enables tractable and stable long-sequence generation beyond the length seen during training (He et al., 2022).
Large-Scale Pretraining: Three-stage recipes—(1) text/image pretraining, (2) large-scale video pretraining with temporal layers, and (3) high-quality video finetuning—lead to stronger spatiotemporal and semantic priors (Blattmann et al., 2023, Blattmann et al., 2023).
Resource Impact: WF-VAE reduces 512 $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 8512 encoding time and memory by 4× over OD-VAE; LeanVAE can be $\mathbf{z}_0 = E(\mathbf{x}) \in \mathbb{R}^{T' \times H' \times W' \times C}$ 9 faster than baseline at $\mathbf{z}_0$ 0 resolution (Cheng et al., 18 Mar 2025, Li et al., 2024).

6. Evaluation, Empirical Findings, and Benchmarks

LVDMs have advanced state-of-the-art synthesis quality and efficiency on video generation tasks:

Perceptual Metrics: LDMVFI achieves the best LPIPS/FloLPIPS across standard VFI benchmarks and is preferred in user studies for visual sharpness (Danier et al., 2023).
Distributional Metrics: Stable Video Diffusion achieves superior FVD on UCF-101 (242.0, lower is better) and outperforms previous open methods in human preference on zero-shot image-to-video synthesis (Blattmann et al., 2023).
Efficiency Metrics: WF-VAE and LeanVAE substantially reduce throughput bottlenecks without quality loss; e.g., LeanVAE is $\mathbf{z}_0$ 1 faster with only marginal FVD/LPIPS degradation (Cheng et al., 18 Mar 2025).
Robustness: CAT-LVDM’s data-aligned corruptions yield FVD reductions of 31.9% on WebVid-2M and 12–16% on UCF-101 relative to standard Gaussian/Uniform corruption (Maduabuchi et al., 24 May 2025).
Convergence: SSVAE demonstrates up to 3× faster generation convergence and 10% higher UnifiedReward compared to vanilla VAE baselines, supporting the importance of spectral biasing (Liu et al., 5 Dec 2025).

7. Open Challenges and Frontiers

Despite substantial progress, LVDMs present active challenges and innovation frontiers:

Video VAE Design: Latent structure, spectral regularization, and balancing spatial/temporal compression remain critical for scaling and ultimate sample fidelity (Liu et al., 5 Dec 2025, Wu et al., 2024).
Long-Horizon Coherence: Hierarchical and iterative denoising mechanisms (e.g., Reuse-and-Diffuse (Gu et al., 2023)) extend feasible output lengths, but boundary artifacts and motion drift remain.
Robust Conditioning: Ensuring stability under noisy/misaligned prompts, leveraging structured corruption-aware objectives, and efficient conditioning at scale are under rapid exploration (Maduabuchi et al., 24 May 2025).
Efficiency–Quality Trade-offs: Patchification, wavelet-flow, and ultra-light bottlenecks yield major efficiency gains, but optimizing for generalization and fidelity in complex dynamic scenes can be challenging (Cheng et al., 18 Mar 2025, Li et al., 2024).
Cross-domain Application: Applications range from frame interpolation and personalized text-to-video (Danier et al., 2023, Blattmann et al., 2023) to multi-view and 3D synthesis (Blattmann et al., 2023). Extensions to multimodal and autoregressive video understanding, as well as integration with large video LLMs, are emerging directions (Maduabuchi et al., 24 May 2025).

The LVDM paradigm is the dominant generative modeling framework for high-fidelity, scalable video synthesis, underpinning current state-of-the-art systems and driving innovations in model architecture, training methodology, and evaluation (Blattmann et al., 2023, He et al., 2022, Li et al., 2024, Liu et al., 5 Dec 2025, Danier et al., 2023).