ShareGPT4Video: Improving Video Understanding and Generation with Better Captions (2406.04325v1)

Published 6 Jun 2024 in cs.CV

Abstract: We present the ShareGPT4Video series, aiming to facilitate the video understanding of large video-LLMs (LVLMs) and the video generation of text-to-video models (T2VMs) via dense and precise captions. The series comprises: 1) ShareGPT4Video, 40K GPT4V annotated dense captions of videos with various lengths and sources, developed through carefully designed data filtering and annotating strategy. 2) ShareCaptioner-Video, an efficient and capable captioning model for arbitrary videos, with 4.8M high-quality aesthetic videos annotated by it. 3) ShareGPT4Video-8B, a simple yet superb LVLM that reached SOTA performance on three advancing video benchmarks. To achieve this, taking aside the non-scalable costly human annotators, we find using GPT4V to caption video with a naive multi-frame or frame-concatenation input strategy leads to less detailed and sometimes temporal-confused results. We argue the challenge of designing a high-quality video captioning strategy lies in three aspects: 1) Inter-frame precise temporal change understanding. 2) Intra-frame detailed content description. 3) Frame-number scalability for arbitrary-length videos. To this end, we meticulously designed a differential video captioning strategy, which is stable, scalable, and efficient for generating captions for videos with arbitrary resolution, aspect ratios, and length. Based on it, we construct ShareGPT4Video, which contains 40K high-quality videos spanning a wide range of categories, and the resulting captions encompass rich world knowledge, object attributes, camera movements, and crucially, detailed and precise temporal descriptions of events. Based on ShareGPT4Video, we further develop ShareCaptioner-Video, a superior captioner capable of efficiently generating high-quality captions for arbitrary videos...

PDF HTML Abstract

ShareGPT4Video: Enhancing Video Understanding and Generation through Advanced Captioning Strategies

The presented paper, "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions," addresses the significant challenge of improving the capability of large video-LLMs (LVLMs) and text-to-video models (T2VMs) through superior captioning methodologies. This paper consists of the development of a novel dataset and model series aimed at enhancing these models' capabilities by leveraging high-quality, detailed video captions.

Contribution and Dataset

The paper introduces the ShareGPT4Video series, which encompasses:

ShareGPT4Video: A dataset consisting of 40,000 dense captions generated through GPT4V for videos of diverse lengths and sources, meticulously curated using sophisticated filtering and annotating strategies.
ShareCaptioner-Video: An efficient and capable captioning model trained on video data, resulting in 4.8M high-quality annotated video captions.
ShareGPT4Video-8B: A state-of-the-art (SOTA) LVLM that demonstrates impressive performance across multiple video benchmarks.

Differential Sliding-window Captioning Strategy (DiffSW)

The challenge of creating large-scale, detailed video captions is non-trivial. Traditional approaches involving multi-frame or frame-concatenation inputs to GPT4V yield results that are often temporally disjointed and devoid of intricate details. To overcome these limitations, the paper presents a Differential Sliding-window Captioning (DiffSW) strategy that focuses on three core aspects:

Inter-frame precise temporal change understanding.
Intra-frame detailed content description.
Frame-number scalability for arbitrary-length videos.

DiffSW operates by generating detailed captions for sequential video frames in a differential manner. Specifically, the strategy uses GPT4V to compare two adjacent frames and describe the changes, ensuring preservation of temporal order and content detail.

Methodology and Data Processing

The ShareGPT4Video dataset is constructed through meticulous source selection and filtering:

Data Collection: The dataset sources diverse content from platforms like Panda-70M, Pexels, Pixabay, MixKit, Ego4D, and BDD100K, focusing on aesthetic quality and content complexity.
Semantic-Based Data Filtering: This strategy reduces content redundancy and maintains diversity by ensuring the selected video candidates have significant thematic variations.
Semantic-aware Key-Frame Extraction: Utilizing a CLIP-Large image encoder, the method ensures that sparsely sampling keyframes captures the crucial semantic changes.

Captioning Pipeline

The DiffSW captioning pipeline is implemented by feeding paired frames and using a Differential Prompt to highlight inter-frame changes. The resulting differential captions are then compiled into a comprehensive temporal narrative through GPT-4. The strategy effectively maintains the rich temporal and spatial information necessary for advanced LVLM training.

Experimental Results

Video Understanding: The developed ShareGPT4Video-8B model shows consistent performance improvements over existing LVLM architectures when trained with the ShareGPT4Video dataset. On benchmarks like VideoBench, MVBench, and TempCompass, the model achieved substantial gains, demonstrating the superior alignment between video and language modalities facilitated by detailed captions.

Video Generation: When applied to text-to-video models, the high-quality captions generated by ShareCaptioner-Video significantly improved video generation tasks. The model showcased enhanced semantic control and temporal coherence in the generated videos compared to models trained on less detailed captions.

Implications and Future Directions

Practically, the ShareGPT4Video series provides a robust dataset and methodology for advancing video understanding and generation tasks. Theoretically, the differential captioning strategy highlights the importance of nuanced temporal understanding in video captioning and could inspire further research into similarly fine-grained approaches in other multi-modal learning contexts.

Future developments are likely to focus on incorporating additional modalities such as audio to further refine video captions and enhance model performance across a broader range of real-world applications. The research underscores the critical role of high-quality, detailed annotations in advancing the state of LVLMs and T2VMs.

In conclusion, "ShareGPT4Video: Improving Video Understanding and Generation with Better Captions" makes significant contributions to the fields of LVLM and T2VM by presenting advanced methodologies for generating high-fidelity video captions, thereby enabling more sophisticated video understanding and generation models. The dataset and models introduced by this paper are expected to serve as pivotal resources for future advancements in multi-modal AI research.