Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer (2204.03638v4)

Published 7 Apr 2022 in cs.CV and cs.AI

Abstract: Videos are created to express emotion, exchange information, and share experiences. Video synthesis has intrigued researchers for a long time. Despite the rapid progress driven by advances in visual synthesis, most existing studies focus on improving the frames' quality and the transitions between them, while little progress has been made in generating longer videos. In this paper, we present a method that builds on 3D-VQGAN and transformers to generate videos with thousands of frames. Our evaluation shows that our model trained on 16-frame video clips from standard benchmarks such as UCF-101, Sky Time-lapse, and Taichi-HD datasets can generate diverse, coherent, and high-quality long videos. We also showcase conditional extensions of our approach for generating meaningful long videos by incorporating temporal information with text and audio. Videos and code can be found at https://songweige.github.io/projects/tats/index.html.

PDF Abstract

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer

This paper presents a method for generating long videos using a combination of Vector Quantized Generative Adversarial Networks (VQGAN) and transformers, specifically tailored to maintain quality and coherence over thousands of frames. The proposed system is named Time-Agnostic VQGAN and Time-Sensitive Transformer (TATS).

Methodology Overview

The approach begins with adapting the 2D VQGAN into a 3D version to encode videos temporally, creating a discrete, compressed latent representation for each video. The initial challenge addressed by the paper is the temporal dependency introduced by zero padding in these convolutional neural networks, which can degrade video quality when generating longer sequences. The authors replace zero padding with replicate padding, ensuring that the encoder preserves time-agnostic properties, which is crucial for employing a sliding window method to handle long sequences effectively.

For the generation process, the paper introduces a hierarchical transformer architecture, where one transformer generates sparse latent frames to establish a global structure. Subsequently, an interpolation transformer refines the segments by autoregressively predicting the missing frames, leveraging the information from the anchors provided by the sparse frames.

Results and Evaluation

The research is evaluated using benchmarks including UCF-101, Sky Time-lapse, Taichi-HD, and AudioSet-Drum datasets. The paper claims that their method achieves state-of-the-art results for short video synthesis and demonstrates robust performance for generating long videos, managing to sustain visual quality and coherence across extensive temporal spans.

The authors propose metrics for assessing long video generation, such as Class Coherence Score (CCS) and Inception Coherence Score (ICS) which measure thematic and qualitative consistencies throughout the generated sequences. The results indicate that TATS maintains higher coherence and quality degradation is significantly delayed compared to other models like MoCoGAN-HD and DIGAN.

Implications and Future Directions

The research contributes to advancing long video generation technologies, emphasizing temporal coherence and thematic fidelity. The methodology and insights regarding padding strategies and hierarchical architectures could be instrumental in applications ranging from autonomous video generation to creative storytelling and virtual reality.

Future developments could explore integrating more complex hierarchical models or employing conditional information with stronger narrative structures directly into the generative process, thus creating videos with dynamic storytelling elements. Additionally, optimizing the computational processing of transformers to reduce inference time can further enhance practical applicability.

In conclusion, this paper broadens the understanding and capability of long video synthesis using generative models, setting a foundation for continued exploration into high-fidelity video generation with complex temporal dependencies.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Songwei Ge (24 papers)
Thomas Hayes (9 papers)
Harry Yang (27 papers)
Xi Yin (88 papers)
Guan Pang (19 papers)
David Jacobs (36 papers)
Jia-Bin Huang (106 papers)
Devi Parikh (129 papers)

Citations (176)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer