VidGen-1M: A Large-Scale Dataset for Text-to-video Generation (2408.02629v1)

Published 5 Aug 2024 in cs.CV

Abstract: The quality of video-text pairs fundamentally determines the upper bound of text-to-video models. Currently, the datasets used for training these models suffer from significant shortcomings, including low temporal consistency, poor-quality captions, substandard video quality, and imbalanced data distribution. The prevailing video curation process, which depends on image models for tagging and manual rule-based curation, leads to a high computational load and leaves behind unclean data. As a result, there is a lack of appropriate training datasets for text-to-video models. To address this problem, we present VidGen-1M, a superior training dataset for text-to-video models. Produced through a coarse-to-fine curation strategy, this dataset guarantees high-quality videos and detailed captions with excellent temporal consistency. When used to train the video generation model, this dataset has led to experimental results that surpass those obtained with other models.

PDF HTML Abstract

An In-depth Analysis of VidGen-1M: A Large-Scale Dataset for Text-to-Video Generation

The paper "VidGen-1M: A Large-Scale Dataset for Text-to-video Generation" presents a novel dataset designed to address the prevalent challenges in existing video-text datasets used for training text-to-video (T2V) models. By focusing on quality and consistency, VidGen-1M aims to elevate the training efficacy and output quality of T2V models.

Key Contributions

High-Quality Dataset Creation: The authors introduce VidGen-1M, a dataset curated through a meticulous multi-stage process ensuring high-quality videos and captions.
Coarse-to-Fine Curation Process: This novel three-stage method—comprising coarse curation, captioning, and fine curation—efficiently produces high-quality, balanced, and temporally consistent data.
Superior Performance: Experimental results show that T2V models trained on VidGen-1M outperform those trained on existing datasets by a significant margin.

Dataset Construction and Methodology

Coarse Curation

The initial stage involves scene splitting, video tagging, and sampling to reduce the computational load in subsequent stages. Utilizing models like PySceneDetect and the RAFT model, the authors ensure the exclusion of temporally inconsistent videos, low-quality aesthetics, and inappropriate scenes.

Scene Splitting

Scene transitions and temporal inconsistencies are identified and addressed using PySceneDetect, effectively filtering out disruptive elements that could impair training.

Tagging

The dataset videos are tagged for quality, temporal consistency, category balance, and motion levels. The LAION Aesthetics model and the CLIP model play pivotal roles in evaluating visual quality and consistency, while RAFT handles motion evaluation. This ensures a curated, balanced, and high-quality dataset.

Captioning

The quality of video captions significantly influences T2V model performance. The authors use VILA for generating descriptive synthetic captions (DSC), which ensures detailed and informative captions. These captions are then vetted using the CLIP score to remove text-video pairs with low similarity.

Fine Curation

To resolve residual issues in the dataset, the fine curation stage leverages LLMs like LLAMA3.1. This stage addresses errors such as scene transitions and repetition in captions, ensuring high-quality, temporally consistent, and accurately aligned text-video pairs.

Experimental Evaluation

Implementation Details

The evaluation involves extensive pre-training on low-resolution videos followed by joint training on VidGen-1M using high-resolution videos. The model comprises spatial and temporal attention blocks, ensuring effective and efficient training.

Results

The qualitative evaluation showcases the model's ability to generate high-quality, photorealistic videos with strong adherence to textual prompts. Notably, videos exhibit significant temporal consistency and realism, demonstrating the dataset's robustness.

Table comparisons reveal that VidGen-1M's captions are significantly richer in content and vocabulary compared to previous datasets. This is reflected in the improved performance metrics such as FVD scores on zero-shot UCF101 tasks, where the model trained on VidGen-1M outperforms state-of-the-art T2V models.

Practical and Theoretical Implications

VidGen-1M's careful design and high-quality curation address the core issues found in existing video-text datasets. This leads to more effective T2V model training, resulting in better alignment between text and generated videos.

Practical Implications:

Enhanced Training Efficiency: The dataset's high quality allows for more efficient training, potentially reducing the computational resources required.
Improved Output Quality: Higher quality and temporally consistent training data lead to the generation of more realistic and coherent videos.

Theoretical Implications:

Dataset Design Principles: VidGen-1M establishes a robust framework for curating large-scale, high-quality multi-modal datasets.
Model Evaluation Standards: The dataset can serve as a benchmark for future T2V models, setting new standards for performance assessments.

Future Directions

VidGen-1M's release includes the dataset, associated code, and trained models, providing a valuable resource for the research community. Looking forward, the framework established for VidGen-1M could be adapted and expanded to other multi-modal datasets, potentially leading to advancements in related fields such as video understanding and generation.

In conclusion, VidGen-1M represents a significant advancement in the field of text-to-video generation, providing a high-quality dataset that addresses the limitations of existing datasets and sets new benchmarks for future research.