CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers (2205.15868v1)

Published 29 May 2022 in cs.CV, cs.CL, and cs.LG

Abstract: Large-scale pretrained transformers have created milestones in text (GPT-3) and text-to-image (DALL-E and CogView) generation. Its application to video generation is still facing many challenges: The potential huge computation cost makes the training from scratch unaffordable; The scarcity and weak relevance of text-video datasets hinder the model understanding complex movement semantics. In this work, we present 9B-parameter transformer CogVideo, trained by inheriting a pretrained text-to-image model, CogView2. We also propose multi-frame-rate hierarchical training strategy to better align text and video clips. As (probably) the first open-source large-scale pretrained text-to-video model, CogVideo outperforms all publicly available models at a large margin in machine and human evaluations.

Authors (5)

Wenyi Hong (14 papers)
Ming Ding (219 papers)
Wendi Zheng (12 papers)
Xinghan Liu (10 papers)
Jie Tang (302 papers)

Citations (407)

View on Semantic Scholar

Summary

An Overview of CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers

The paper "CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers" presents a notable advancement in the application of large-scale pretrained transformers, specifically targeting the domain of text-to-video generation. The paper confronts prevalent challenges such as computational expense and the scarcity of quality text-video datasets, by inheriting and fine-tuning a previously successful text-to-image model, CogView2, to develop CogVideo. This initiative positions CogVideo as a significant contribution to the domain of text-to-video generation, outpacing existing models in terms of performance metrics.

Key Contributions and Methodology

Utilization of Pretrained Models: One of the pivotal contributions of this research is leveraging large-scale pretrained text-to-image models to facilitate text-to-video generation, effectively sidestepping the need for costly training procedures from ground zero. By adapting CogView2, CogVideo inherits substantial pre-existing understanding of spatial semantics, which is crucial in translating textual cues into coherent video sequences.
Hierarchical Multi-frame-rate Training Strategy: In response to the challenge of aligning textual prompts with video semantics, the paper proposes a multi-frame-rate hierarchical training strategy. This approach enhances the model's ability to comprehend complex movement semantics by adjusting the frame rate dynamically to fit the textual description, thereby maintaining alignment between the video content and its associated text.
Dual-channel Attention Mechanism: The proposed dual-channel attention mechanism allows CogVideo to integrate temporal attention without compromising the spatial knowledge from CogView2. This mechanism includes a fixed pre-trained spatial channel and a newly trained temporal channel, enabling efficient video production that respects both static and dynamic elements.
Shifted Window Attention for Efficiency: The research incorporates a modification of Swin Attention tailored for autoregressive video generation. This adjustment permits efficiency gains by allowing parts of the video generation process to run in parallel, thereby optimizing both memory and processing time.

Results and Evaluation

CogVideo's efficacy is established through extensive evaluations involving both machine and human assessments. The model has demonstrated substantial superiority over existing publicly available models, such as TGANv2 and VideoGPT, on benchmarks including UCF101 and Kinetics-600. Notably, human evaluations also underscore CogVideo's proficiency in generating videos with high frame texture quality and semantic relevance.

Implications and Future Directions

This research holds considerable implications for future developments in AI-related video generation. By demonstrating a robust method to adapt pretrained models from one domain (text-to-image) to another (text-to-video), it opens up pathways for more efficient cross-domain model applications. Moreover, the multi-frame-rate hierarchical training might prompt further exploration into adaptive frame strategies to enhance semantic consistency in video generation.

The successful deployment of CogVideo also highlights potential reductions in computational resources necessary for training extensive generational models. This efficiency can facilitate broader accessibility to cutting-edge video generation capabilities across varied AI applications, from automated film production to sophisticated virtual reality environments.

While CogVideo sets a new performance standard, future exploration could focus on further improving the model's capacity to handle lengthy or particularly complex action sequences, overcoming current limitations related to sequence length and model scale. Furthermore, concerns about misinformation through highly realistic video synthesis could spark research into robust detection mechanisms to safeguard against malicious uses.

In conclusion, the CogVideo framework represents a significant step forward in video synthesis technologies, particularly in its methodological choices and strategic leverage of pre-trained models, with implications that reverberate across AI research and application landscapes.

PDF Markdown

Related Papers

Tweets

https://twitter.com/ryo694/status/1822963082953707960

https://twitter.com/gouravvyadav/status/1881412498638971314

YouTube

Show All Videos