InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation (2307.06942v2)

Published 13 Jul 2023 in cs.CV

Abstract: This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with LLMs (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

PDF Abstract

LAVIC: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

The paper presents LAVIC, a comprehensive video-centric multimodal dataset specifically designed to foster the development of robust video-text representation models. As the demand for integrated video and natural language processing models has intensified, so has the need for large-scale, high-quality datasets that enable this integration. LAVIC addresses this gap by amalgamating over 7 million videos, encapsulating around 234 million video clips, each richly annotated with textual descriptions generated primarily via LLMs.

Key Contributions

Dataset Composition and Scale: LAVIC sets itself apart by its vast scale and detailed textual descriptions, encompassing 4.1 billion words spread across various contexts and content types. Previous datasets fell short either in scale, such as HowTo100M or WebVid10M, or in the quality of video-text alignment, an issue LAVIC actively addresses.
Innovative Annotation Methodology: The dataset leverages a multi-scale approach harnessed by LLMs to automatically generate video descriptions, thereby ensuring high-quality video-text alignment at scale. This strategy is instrumental, particularly given the limitations of ASR-generated text commonly used in existing datasets.
Introduction of the ViCLIP Model: The research advances a novel video-text representation learning model, ViCLIP, grounded on the Vision Transformer (ViT-L). This model is trained using contrastive learning on the LAVIC dataset, showcasing its efficacy through superior performance in zero-shot action recognition and competitive video retrieval.
Practical Applications: Beyond standard tasks like video retrieval and recognition, LAVIC and ViCLIP's design is poised to excel in generating interleaved video-text datasets conducive for training video-centric dialogue systems, as well as advancing video-to-text and text-to-video generation research.

Numerical Outcomes and Performance

The ViCLIP model, when trained on LAVIC, achieves a notable zero-shot performance, underscoring 75.7%, 73.5%, and 66.4% top-1 accuracy in K400, K600, and K700 action recognition datasets, respectively. This illustrates the model's superior generalization capability over other Video CLIP variations, particularly significant in video understanding and retrieval tasks.

Implications and Future Directions

The implications of LAVIC extend beyond academic research into practical domains like human-computer interaction, autonomous driving, and intelligent surveillance, where the seamless integration of video understanding into real-world applications holds substantial potential. The dataset's design and use demonstrate pivotal advances in multimodal dialogue systems, pushing the boundaries of what AI can achieve in understanding and generating multimodal content.

Moreover, LAVIC's assembly and success hint at future trajectories in AI, where generating plausible multi-modal narratives could become a haLLMark of sophisticated AI systems. The interplay between visual data and language in LAVIC sets a precedent for future datasets to harness, enabling more intuitive and contextually aware AI models.

In conclusion, LAVIC emerges as a significant resource for the AI research community, spotlighting the symbiosis between large-scale data and advanced learning models to drive the evolution of video-text comprehension and generation capabilities in AI.

PDF Markdown Bookmark Chat (Pro)

Authors (16)

Yi Wang (1038 papers)
Yinan He (34 papers)
Yizhuo Li (21 papers)
Jiashuo Yu (19 papers)
Xin Ma (105 papers)
Xinyuan Chen (48 papers)
Yaohui Wang (50 papers)
Ping Luo (340 papers)
Ziwei Liu (368 papers)
Yali Wang (78 papers)
Limin Wang (221 papers)
Yu Qiao (563 papers)
Xinhao Li (29 papers)
Guo Chen (107 papers)
Conghui He (114 papers)
KunChang Li (43 papers)

Citations (162)

View on Semantic Scholar

Related Papers

Find Related Papers

YouTube

Show All Videos