Latent Video Diffusion Backbone
- Latent Video Diffusion Backbone is the core framework that decouples appearance content and motion through denoising diffusion in a low-dimensional latent space.
- It employs dual U-Net denoisers with attention and positional group normalization to enhance spatial-temporal coherence in video generation.
- Autoregressive synthesis with robust loss functions and latent motion conditioning leads to state-of-the-art performance on large-scale, high-resolution video datasets.
A latent video diffusion backbone is the core architectural and mathematical infrastructure that enables generative modeling of videos via denoising diffusion processes in a learned, typically compressed, latent space instead of pixels. By decoupling appearance content and motion, reducing dimensionality, and leveraging tailored conditioning mechanisms, such backbones make high-fidelity and temporally coherent video synthesis tractable on large-scale datasets and at high resolutions. The backbone as realized in VIDM ("Video Implicit Diffusion Models") comprises framewise convolutional encoding, dual U-Net–style denoisers with attention and specialized normalization, robustness-enhanced objectives, and explicit latent motion conditioning, organized into an autoregressive video generation pipeline that yields state-of-the-art quality and efficiency (Mei et al., 2022).
1. Mathematical Formulation in Latent Space
The backbone relies on a forward noising process in a low-dimensional latent space. Given a frame-wise or latent representation , each video frame undergoes a -step Gaussian diffusion process: or expressed in closed form,
The reverse process is learned via deep neural networks , with two key variants:
- Content denoiser for initial frame generation,
- Motion denoiser that incorporates a motion latent and residual .
The reverse kernel is: where
The loss objective replaces the standard DDPM mean-squared error with a robust Charbonnier penalty:
2. Content and Motion-Focused U-Net Architecture
- Per-frame Encoder: Four strided convolutional blocks (GroupNorm + ReLU) downsample to a latent with or $128$, . This representation conditions both content and motion branches.
- U-Net Denoisers: Both content and motion denoisers share a U-Net backbone:
- Down path: four spatial resolutions; each level employs two convolutions, GroupNorm, SiLU activation, and multi-head self-attention at the coarsest level (). Downsampling is via strided convolution.
- Up path: nearest-neighbor upsampling and mirror convolution/normalization.
- Timestep is injected through sinusoidal embeddings at each block. The motion network additionally injects the implicit motion code via MLP FiLM gating.
- A learnable truncation constant is channel-wise concatenated with at every step.
- Positional Group Normalization (PosGN): Each GroupNorm layer is replaced by
facilitating spatial and temporal modulation, especially important in the motion branch.
3. Latent Motion Conditioning and Implicit Dynamics
- Motion Latent : Computed via a pretrained SpyNet network to estimate an optical-flow-like representation from , sharing spatial resolution with the U-Net's bottleneck. is injected into all denoising blocks via FiLM layers.
- Residual Term : An adaptive residual is obtained through a separate encoder applied to the first frame and timestep, enhancing the reverse kernel for the motion denoiser:
4. Regularization and Sampling Improvements
- Sampling-Space Truncation: A learnable constant tensor is introduced in the input of the U-Net, constraining the generative noise space as in StyleGAN truncation. is fixed during inference.
- Robustness Penalty: The Charbonnier loss function prevents overfitting and eliminates the need for dropout.
- Positional GroupNorm: As in Section 2, provides coordinate-aware normalization across space and time.
5. Autoregressive Video Generation and Training Protocols
- Autoregressive Synthesis:
- The first frame is generated via the content denoiser from pure noise.
- Subsequent frames are sampled by the motion denoiser conditioned on computed , residual , and the already generated frames.
- Training Hyperparameters:
- 1,000 diffusion steps, linearly scheduled from to $0.02$.
- Batch size 32 per GPU.
- Adam optimizer with learning rate , no weight decay.
- The content and motion networks are trained for approximately 1 million steps each, sequentially.
- Efficiency:
- For frames and diffusion steps, the total cost is (cost of one U-Net forward).
- On resolution, generating 16 frames with 1,000 steps per frame takes about 800 seconds on an A100.
- Inference is commonly reduced to 50–100 steps using distillation or accelerated samplers.
6. Summary Data Flow and Implementation Blueprint
Content Training:
Motion Training:
Generation (Autoregressive):
This modular backbone allows for experimentation with U-Net capacity, attention depth, noise schedule, and conditioning schemes.
7. Empirical Outcomes and Usability
Experiments demonstrate that VIDM significantly outperforms GAN-based methods on Fréchet Video Distance (FVD) and visual coherence, with improvements attributed to the four key strategies: latent-space diffusion, explicit motion/appearance separation, positional normalization, and the truncation/robustness enhancements. This enables tractable and scalable state-of-the-art video synthesis, providing a robust backbone for further research in latent video diffusion (Mei et al., 2022).