Latent Video Diffusion Models for High-Fidelity Long Video Generation

Published 23 Nov 2022 in cs.CV and cs.AI | (2211.13221v2)

Abstract: AI-generated content has attracted lots of attention recently, but photo-realistic video synthesis is still challenging. Although many attempts using GANs and autoregressive models have been made in this area, the visual quality and length of generated videos are far from satisfactory. Diffusion models have shown remarkable results recently but require significant computational resources. To address this, we introduce lightweight video diffusion models by leveraging a low-dimensional 3D latent space, significantly outperforming previous pixel-space video diffusion models under a limited computational budget. In addition, we propose hierarchical diffusion in the latent space such that longer videos with more than one thousand frames can be produced. To further overcome the performance degradation issue for long video generation, we propose conditional latent perturbation and unconditional guidance that effectively mitigate the accumulated errors during the extension of video length. Extensive experiments on small domain datasets of different categories suggest that our framework generates more realistic and longer videos than previous strong baselines. We additionally provide an extension to large-scale text-to-video generation to demonstrate the superiority of our work. Our code and models will be made publicly available.

Abstract PDF Upgrade to Chat

Authors (5)

Citations (125)

View on Semantic Scholar

Summary

The paper presents a novel hierarchical latent video diffusion model that leverages a 3D video autoencoder and diffusion processes to generate high-fidelity, long-duration videos.
The model reduces computational load by operating in a low-dimensional latent space and uses hierarchical generation to mitigate error accumulation over extended sequences.
Experimental results show significant improvements over previous approaches, with robust performance in both short and long video generation and applicability to text-to-video synthesis.

Latent Video Diffusion Models for High-Fidelity Long Video Generation

Introduction

The research paper "Latent Video Diffusion Models for High-Fidelity Long Video Generation" addresses the persistent challenge of generating high-quality and lengthy videos—a significant need within AI-generated content, gaming, and film production domains. Traditional GANs and autoregressive models have fallen short, unable to deliver the requisite fidelity and temporal scope due to limitations such as instability and mode collapse. The paper introduces a latent-video-diffusion-model framework (LVDM) that leverages low-dimensional 3D latent spaces and hierarchical diffusion processes to achieve extended video generation without prohibitive computational costs.

Methodology

Hierarchical Latent Video Diffusion Model (LVDM)

The LVDM framework introduces a novel hierarchical generative architecture to overcome the constraints of existing diffusion models.

Video Autoencoder: At the heart of the LVDM is a 3D autoencoder tasked with compressing video frames into a latent representation. This compression significantly reduces computational load by processing the video in a latent space rather than pixel-space.
Diffusion Processes: The diffusion process gradually adds noise to the latent video representations at each timestep, allowing the reverse process to learn to synthesize video by denoising in latent space.
Hierarchical Generation: By adopting a hierarchical setup, LVDM generates sparse frames initially and then employs an interpolation model to infill missing frames. This approach helps manage performance degradation typical in long autoregressive video prediction tasks (Figure 1).
Figure 1: Hierarchical LVDM Framework illustrating the generation of longer videos beyond initial training constraints.
Conditional Latent Perturbation: As a mechanism to mitigate errors during extended video production, LVDM implements perturbation on latent variables, enhancing video prediction fidelity.
Unconditional Guidance: By guiding video generation along a learned unconditional video trajectory, LVDM improves coherence during long video predictions, reducing accumulated error impacts over time.

Experimental Results

This section highlights the empirical validation of LVDM against state-of-the-art video generation techniques on datasets such as UCF-101, Sky Timelapse, and Taichi.

Short Video Generation: The paper notes significant improvements over prior models, exhibiting a marked reduction in metrics like Fréchet Video Distance (FVD) and Kernel Video Discrepancy (KVD) across different benchmarks and resolutions.
Figure 2: Qualitative comparison highlighting LVDM's superior spatiotemporal consistency versus state-of-the-art models in short video synthesis.
Long Video Generation: LVDM prevails over TATS in generating videos extending beyond 1000 frames, with negligible degradation in metric evaluations over time (Figure 3, Figure 4).
Figure 3: LVDM's long video generation showing improved quality retention compared to TATS.

Figure 4: Quantitative analysis illustrating LVDM's performance over TATS on extended video generation tasks.

Extension to Text-to-Video

An extension to LVDM accommodates text-to-video synthesis, upgrading the model with pre-trained text-to-image systems to adeptly generate coherent video sequences from textual descriptions. The framework's versatility is validated on subsets from datasets like WebVid, demonstrating its applicability in multimodal settings (Figure 5).

Figure 5: Results of LVDM's extension to text-guided video generation.

Conclusion

The LVDM framework introduces an efficient and scalable approach for generating high-fidelity, long-duration videos. By optimizing latent space dynamics and employing hierarchical strategies, LVDM efficiently broadens video generation capabilities beyond prevailing models. This approach's robust applicability extends to text-to-video domains, setting a benchmark for future explorations in video synthesis, promising further enhancements in training efficiency and architectural evolution.

Markdown Report Issue