LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

Published 14 Oct 2024 in cs.CV, cs.AI, and cs.LG | (2410.10816v1)

Abstract: The efficacy of video generation models heavily depends on the quality of their training datasets. Most previous video generation models are trained on short video clips, while recently there has been increasing interest in training long video generation models directly on longer videos. However, the lack of such high-quality long videos impedes the advancement of long video generation. To promote research in long video generation, we desire a new dataset with four key features essential for training long video generation models: (1) long videos covering at least 10 seconds, (2) long-take videos without cuts, (3) large motion and diverse contents, and (4) temporally dense captions. To achieve this, we introduce a new pipeline for selecting high-quality long-take videos and generating temporally dense captions. Specifically, we define a set of metrics to quantitatively assess video quality including scene cuts, dynamic degrees, and semantic-level quality, enabling us to filter high-quality long-take videos from a large amount of source videos. Subsequently, we develop a hierarchical video captioning pipeline to annotate long videos with temporally-dense captions. With this pipeline, we curate the first long-take video dataset, LVD-2M, comprising 2 million long-take videos, each covering more than 10 seconds and annotated with temporally dense captions. We further validate the effectiveness of LVD-2M by fine-tuning video generation models to generate long videos with dynamic motions. We believe our work will significantly contribute to future research in long video generation.

Abstract PDF HTML Upgrade to Chat

Authors (6)

Summary

The paper introduces LVD-2M to overcome the limitations of short video datasets by providing long-take videos with rich temporal captions.
It employs advanced filtering methods including scene cut detection, optical flow estimation, and multilingual semantic models to ensure high-quality, dynamic content.
Hierarchical captioning using Vision and Language Models produces coherent narrative captions, validated by superior numerical and human evaluation results.

Analysis of LVD-2M: A Long-take Video Dataset with Temporally Dense Captions

The paper introduces a novel dataset, LVD-2M, designed to advance the field of long video generation by addressing the limitations of existing datasets. The authors argue that most video generation models are constrained by training typically conducted on short video clips, lacking the temporal consistency and dynamic motion required for longer video renditions. LVD-2M is constructed to facilitate the development of these long video models by providing a dataset comprised of long-take videos annotated with temporally dense captions.

Dataset Composition and Methodology

LVD-2M is characterized by its selection of videos over 10 seconds in length without scene cuts, encompassing large motion dynamics and diverse content. A key innovation is the introduction of an automatic video filtering and captioning pipeline, designed to efficiently filter and annotate high-quality, long-take videos:

Video Filtering: The dataset employs both low-level and semantic-level filtering techniques. Scene cut detection and optical flow estimation are used to ensure the temporal consistency and dynamic motion of selected videos. Subsequently, Multi-Language Large Models (MLLMs) are employed for semantic filtering, removing low-quality videos lacking diversity or containing extensive text overlays.
Hierarchical Video Captioning: LVD-2M advances video captioning by employing a hierarchical approach to generate temporally dense captions. Using Vision LLMs (VLMs), video segments are annotated to capture intricate temporal dynamics, further refined and merged into seamless narrative captions by LLMs.

Evaluation and Comparisons

Numerical analyses illustrate the superiority of LVD-2M over existing datasets. The dataset's average video length and optical flow scores represent substantial improvements, with captions significantly longer than those found in datasets like WebVid. Human evaluations further validate LVD-2M, exhibiting higher preferences in dynamic degree and long-take consistency.

Implications and Future Directions

The research breakthroughs associated with LVD-2M have several implications for long video generation. By providing a dataset that more accurately reflects the temporal dynamics found in real-world scenarios, the authors posit that LVD-2M could significantly enhance the performance of existing models. Fine-tuning on LVD-2M has demonstrated improved outcomes in generating videos with coherent temporal transitions and dynamic motions.

Looking ahead, integrating such datasets could fuel further innovation in AI applications requiring extended video analysis and synthesis. Future work may explore refining captioning methodologies to enhance the narrative coherence and contextual relevance of video annotations.

In summary, the construction and validation of LVD-2M mark a critical step forward in overcoming the challenges of modeling long-range temporal dependencies in video generation. This paper sets the stage for subsequent research to build upon these findings, with the potential to significantly advance the capabilities of AI in understanding and generating complex video sequences.

Markdown Report Issue