GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions (2104.14806v1)

Published 30 Apr 2021 in cs.CV

Abstract: Generating videos from text is a challenging task due to its high computational requirements for training and infinite possible answers for evaluation. Existing works typically experiment on simple or small datasets, where the generalization ability is quite limited. In this work, we propose GODIVA, an open-domain text-to-video pretrained model that can generate videos from text in an auto-regressive manner using a three-dimensional sparse attention mechanism. We pretrain our model on Howto100M, a large-scale text-video dataset that contains more than 136 million text-video pairs. Experiments show that GODIVA not only can be fine-tuned on downstream video generation tasks, but also has a good zero-shot capability on unseen texts. We also propose a new metric called Relative Matching (RM) to automatically evaluate the video generation quality. Several challenges are listed and discussed as future work.

PDF Abstract

GODIVA: Generating Open-Domain Videos from Natural Descriptions

The task of text-to-video (T2V) generation presents unique challenges given the high computational cost and the variability of potential outputs. The research paper "GODIVA: Generating Open-Domain Videos from Natural Descriptions" introduces GODIVA, a pioneering model designed to address these challenges by leveraging a large-scale dataset and adopting a novel architecture.

Key Contributions

Pre-training and Dataset Utilization: GODIVA is trained on the extensive Howto100M dataset, comprising over 136 million text-video pairs. This large dataset provides a robust foundation for pretraining, enhancing the model's generalization capabilities and allowing for effective fine-tuning on specific tasks.
Architectural Framework: GODIVA utilizes a variant of the Vector Quantized Variational Autoencoder (VQ-VAE) to encode video frames into discrete tokens. This forms the basis for an autoregressive model that generates videos in an open domain, innovatively incorporating a three-dimensional sparse attention mechanism. This mechanism efficiently narrows the range of possible dependencies, reducing computational loads while maintaining high performance.
Evaluation Metrics: The paper presents a novel Relative Matching (RM) metric specifically designed for automated evaluation of generated video quality. This metric accounts for both the visual fidelity and the semantic alignment of generated content with the input text.
Zero-shot Capabilities: Beyond finely tuning on domain-specific tasks, GODIVA demonstrates strong zero-shot performance, producing coherent videos for previously unseen text inputs, showcasing its potential for broad applications.

Implementation and Results

GODIVA's architecture leverages pretrained VQ-VAE components to encode continuous pixel data into discrete forms. This discrete representation serves as the input for a text-conditioned transformer model which employs a sparse attention mechanism that threads through spatial and temporal dimensions, significantly improving computation efficiency. The attention focuses on key regions related to input text, thus optimizing video generation across frames.

In empirical evaluations, GODIVA surpasses several baselines in both quantitative metrics (like SIM and RM) and qualitative assessments. Its performance is notably highlighted in diverse scenarios, such as generating coherent, textually accurate videos without prior domain-specific retraining.

Discussion and Future Directions

The introduction of GODIVA highlights the importance of large-scale pretraining and architectural innovation in handling complex generation tasks like T2V. The model's reliance on efficient sparse attention mechanisms suggests a pathway for overcoming the computational barriers typical of high-dimensional tasks. Moreover, the Relative Matching metric could become a standard tool for future T2V evaluations, providing a nuanced gauge for semantic coherence.

Future research might expand upon GODIVA in several ways:

Resolution and Length: Enhancing the model to support high-resolution, long-duration videos would be a natural next step, likely involving new strategies for memory and computational efficiency.
Cross-modal Integration: Incorporating additional modalities (such as audio) could further broaden the scope and applicability of generated videos.
Refinement of Evaluation Metrics: Continued refinement of evaluation metrics can ensure more precise assessments of generative quality, essential for application across varied real-world settings.

In conclusion, GODIVA represents a significant advancement in the field of T2V generation, combining robust mathematical frameworks with large-scale language-video data to produce vivid and contextually appropriate visual content. Its contributions lay the groundwork for future exploration and innovation in automated video generation from textual descriptions.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Chenfei Wu (32 papers)
Lun Huang (5 papers)
Qianxi Zhang (6 papers)
Binyang Li (10 papers)
Lei Ji (33 papers)
Fan Yang (877 papers)
Guillermo Sapiro (101 papers)
Nan Duan (172 papers)

Citations (192)

View on Semantic Scholar

GODIVA: Generating Open-DomaIn Videos from nAtural Descriptions (2104.14806v1)