CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer (2408.06072v2)

Published 12 Aug 2024 in cs.CV

Abstract: We present CogVideoX, a large-scale text-to-video generation model based on diffusion transformer, which can generate 10-second continuous videos aligned with text prompt, with a frame rate of 16 fps and resolution of 768 * 1360 pixels. Previous video generation models often had limited movement and short durations, and is difficult to generate videos with coherent narratives based on text. We propose several designs to address these issues. First, we propose a 3D Variational Autoencoder (VAE) to compress videos along both spatial and temporal dimensions, to improve both compression rate and video fidelity. Second, to improve the text-video alignment, we propose an expert transformer with the expert adaptive LayerNorm to facilitate the deep fusion between the two modalities. Third, by employing a progressive training and multi-resolution frame pack technique, CogVideoX is adept at producing coherent, long-duration, different shape videos characterized by significant motions. In addition, we develop an effective text-video data processing pipeline that includes various data preprocessing strategies and a video captioning method, greatly contributing to the generation quality and semantic alignment. Results show that CogVideoX demonstrates state-of-the-art performance across both multiple machine metrics and human evaluations. The model weight of both 3D Causal VAE, Video caption model and CogVideoX are publicly available at https://github.com/THUDM/CogVideo.

PDF HTML Abstract

CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer

The paper "CogVideoX: Text-to-Video Diffusion Models with An Expert Transformer" introduces CogVideoX, a sophisticated text-to-video generation model leveraging large-scale diffusion transformers. The proposed system integrates a plethora of advancements in video modeling architecture, data processing pipelines, and training methodologies tailored for long-term temporally consistent video generation. Below is a nuanced exploration into the core contributions, results, and forward-looking implications of this model.

Core Contributions

The authors of CogVideoX tackle several significant challenges in text-to-video generation, namely efficient video data modeling, superior text-video alignment, and the creation of high-quality text-video datasets for effective model training. To address these issues, the paper presents three key technological developments: the 3D Variational Autoencoder (VAE), the Expert Transformer with adaptive LayerNorm, and a comprehensive text-video data processing pipeline.

3D Causal VAE:

The 3D VAE designed for CogVideoX compresses video data not only spatially but also temporally, which significantly reduces the sequence length and training computational requirements. This approach also alleviates the flicker often observed in videos generated by models using 2D VAEs. By employing temporally causal convolutions and context parallelism, the 3D VAE maintains temporal causality while managing large video datasets efficiently.

Expert Transformer:

For the enhancement of text-video alignment, the authors propose an Expert Transformer with Expert Adaptive LayerNorm (AdaLN). This component facilitates the deep fusion between video and text modalities, thus improving the coherence and quality of the generated videos. The model applies 3D full attention to accommodate large temporal motions, diverging from the commonly used separated spatial and temporal attention mechanisms that often lead to inferior video consistency.

Progressive and Mixed-Duration Training:

In training, the model employs mixed-duration training by incorporating videos of varying lengths within the same batch, a technique referred to as Frame Pack. This methodology, together with resolution progressive training, enables the model to generalize effectively across different video lengths and resolutions. Additionally, the adoption of Explicit Uniform Sampling ensures a stable training loss curve, further enhancing training efficiency.

Empirical Evaluation

The CogVideoX model exhibits state-of-the-art performance across numerous automated metrics. Specifically, it excels in metrics such as Human Action, Scene, Dynamic Degree, Multiple Objects, Appearance Style, Dynamic Quality, and GPT4o-MTScore. Comparative evaluations indicate that CogVideoX surpasses multiple leading models including T2V-Turbo, AnimateDiff, VideoCrafter-2.0, OpenSora V1.2, Show-1, Gen-2, Pika, and LaVie-2.

Table 1 in the paper illustrates the quantitative performance of CogVideoX, with the 5B parameter model notably outperforming existing models in several evaluation categories. Additionally, human evaluations covering aspects like Sensory Quality, Instruction Following, Physics Simulation, and Cover Quality further affirm the superiority of CogVideoX over competitors, reflecting its capability to generate videos that are not only visually coherent but also semantically aligned with the input prompts.

Implications and Future Directions

CogVideoX's prominent achievements underscore several critical implications for both theoretical and practical advancements in AI-driven video generation:

Practical Applications:

The model's capability to generate coherent, high-quality videos from text prompts has potential applications in a variety of domains including entertainment, education, marketing, and content creation. The open-source availability of the model weights for both the 3D Causal VAE and CogVideoX also facilitates reproducibility and further exploration by the research community.

Theoretical Advancements:

The integration of 3D VAEs and Expert Transformers sets a new benchmark in the efficiency and effectiveness of video diffusion models. Future research can build on these foundations to further enhance model architecture and training routines, potentially extending these techniques to even higher resolutions and longer video durations.

Future Developments:

Looking ahead, scaling CogVideoX by training on larger datasets and refining the model could lead to even more impressive results. In particular, exploring the scaling laws of video generation and focusing on generating longer and higher-resolution videos can push the boundaries of text-to-video generation capabilities.

In conclusion, the paper presents CogVideoX as a robust advancement in the field of text-to-video generation, showcasing both practical and theoretical innovations that pave the way for future exploration and application in multimedia AI.