LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models (2309.15103v2)

Published 26 Sep 2023 in cs.CV

Abstract: This work aims to learn a high-quality text-to-video (T2V) generative model by leveraging a pre-trained text-to-image (T2I) model as a basis. It is a highly desirable yet challenging task to simultaneously a) accomplish the synthesis of visually realistic and temporally coherent videos while b) preserving the strong creative generation nature of the pre-trained T2I model. To this end, we propose LaVie, an integrated video generation framework that operates on cascaded video latent diffusion models, comprising a base T2V model, a temporal interpolation model, and a video super-resolution model. Our key insights are two-fold: 1) We reveal that the incorporation of simple temporal self-attentions, coupled with rotary positional encoding, adequately captures the temporal correlations inherent in video data. 2) Additionally, we validate that the process of joint image-video fine-tuning plays a pivotal role in producing high-quality and creative outcomes. To enhance the performance of LaVie, we contribute a comprehensive and diverse video dataset named Vimeo25M, consisting of 25 million text-video pairs that prioritize quality, diversity, and aesthetic appeal. Extensive experiments demonstrate that LaVie achieves state-of-the-art performance both quantitatively and qualitatively. Furthermore, we showcase the versatility of pre-trained LaVie models in various long video generation and personalized video synthesis applications.

PDF Abstract

LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models

The paper "LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models" addresses a prominent challenge in the field of generative models by advancing text-to-video (T2V) generation capabilities using a novel framework called LaVie. This framework enhances the synthesis of visually realistic and temporally coherent videos by leveraging pre-trained text-to-image (T2I) models while maintaining creative generative prowess.

Core Contributions

The authors propose a comprehensive video generation framework that operates through cascaded latent diffusion models, consisting of a base T2V model, a temporal interpolation model, and a video super-resolution model. This design aims to mitigate the complexities of training T2V systems from scratch, which often require vast computational resources to capture spatio-temporal joint distributions effectively.

There are two primary insights emphasized:

Temporal Correlation Capture: Utilizing simple temporal self-attentions combined with rotary positional encoding (RoPE) is effective in capturing the temporal correlations inherent in video data. This approach demonstrates efficiency and effectiveness without the need for significantly complicated architecture changes.
Joint Image-Video Fine-Tuning: The process of jointly fine-tuning images and videos is pivotal in generating high-quality and creative video outcomes. This method prevents the model from catastrophic forgetting and enhances concept mixing, effectively transferring large-scale knowledge from images to videos.

Dataset Contributions

To bolster the effectiveness of LaVie, the authors introduce Vimeo25M, a diverse dataset containing 25 million high-resolution text-video pairs that are meticulously curated for quality, diversity, and aesthetic sophistication. This dataset represents a significant improvement over prior datasets like WebVid10M in resolution and aesthetic scores.

Experimental Results

Experimental evaluations demonstrate that LaVie achieves state-of-the-art performance in T2V generation both quantitatively (with metric-based evaluations) and qualitatively (through visual comparisons with existing models such as Make-A-Video and VideoLDM). The fidelity of video quality and temporal coherence of the generated samples are notable.

The paper also describes versatile applications of LaVie models in creating longer video sequences and personalized video generation scenarios. These results highlight LaVie's ability to adapt and scale based on generative requirements and personalizations.

Limitations and Future Direction

While LaVie marks a significant stride in T2V generation, limitations are also acknowledged. The capability to generate multiple subjects and refine human-like details such as hands requires further research. Future work entails advancement in the synthesis of longer, more complex videos with high cinematic quality, guided by intricate textual descriptions.

Implications and Future Work

The implications of this research are broad, impacting applications in filmmaking, interactive media, and virtual content creation. The methodological framework and dataset contributions present significant potential for further developments in AI-driven video synthesis. The integration of advanced LLMs for nuanced text understanding and improved multi-subject handling are viable pathways advancing LaVie's capabilities.

In summary, the paper presents a robust approach to high-quality video generation, standing on the shoulders of successful T2I models, and introduces new methodologies and datasets that address longstanding challenges in T2V synthesis. LaVie represents a significant technological advancement, laying a foundation for future exploration and innovation in video generation assisted by artificial intelligence.

PDF Markdown Bookmark Chat (Pro)

Authors (20)

Yaohui Wang (50 papers)
Xinyuan Chen (48 papers)
Xin Ma (105 papers)
Shangchen Zhou (58 papers)
Ziqi Huang (20 papers)
Yi Wang (1038 papers)
Ceyuan Yang (51 papers)
Yinan He (34 papers)
Jiashuo Yu (19 papers)
Peiqing Yang (9 papers)
Yuwei Guo (20 papers)
Tianxing Wu (24 papers)
Chenyang Si (36 papers)
Yuming Jiang (73 papers)
Cunjian Chen (21 papers)
Chen Change Loy (288 papers)
Bo Dai (244 papers)
Dahua Lin (336 papers)
Yu Qiao (563 papers)
Ziwei Liu (368 papers)

Citations (186)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos