InternVideo: General Video Foundation Models via Generative and Discriminative Learning (2212.03191v2)

Published 6 Dec 2022 in cs.CV

Abstract: The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .

PDF Abstract

Overview of InternVideo: General Video Foundation Models

The paper "InternVideo: General Video Foundation Models via Generative and Discriminative Learning" advances the research landscape in video foundation modeling by introducing InternVideo, a model leveraging both generative and discriminative self-supervised learning techniques. This work addresses the limitations prevalent in current vision foundation models, which predominantly focus on image-level pretraining, thereby failing to capture the dynamic nature of video content.

InternVideo specifically incorporates masked video modeling and video-language contrastive learning as its core pretraining strategies. The model achieves significant improvements across a spectrum of video-related tasks, establishing state-of-the-art results on 39 diverse datasets. These tasks encompass video action recognition and detection, video-language alignment, and video applications in open-world settings, highlighting the comprehensive applicability of the model.

Key Numerical Results and Claims

InternVideo achieves notable top-1 accuracy scores of 91.1% and 77.2% on the Kinetics-400 and Something-Something V2 benchmarks, respectively. Such figures underline its proficiency in video action recognition. The model outperforms existing methods significantly, not only in action recognition but also in video retrieval and video question answering, as evidenced by its superior performance on benchmarks such as MSR-VTT and MSVD.

Technical Approach

InternVideo employs a Unified Video Representation (UVR) paradigm, intelligently integrating both masked autoencoders and multimodal contrastive learning. The approach develops from a scalable VideoMAE framework and involves extending image-pretrained backbones to video contexts through the use of local and global spatiotemporal modules. This process involves the application of supervised action classification to fine-tune video representations. Furthermore, the model implements cross-model attention to harmonize feature alignments between the distinct self-supervised training paradigms.

Implications and Future Developments

InternVideo contributes significantly to the field by setting a high-performance baseline across multiple domains of video understanding. This presents practical implications for industries relying on robust video analysis, such as surveillance, entertainment, and autonomous systems. Theoretically, it pushes the boundaries of knowledge in multimodal learning and broadens the horizon for future research in video foundation models.

Future research might explore the extension of InternVideo's capabilities to long-term video tasks and high-order cognitive challenges, such as anticipatory video processing. The exploration of systematic coordination among multiple foundation models trained across varied modalities presents a promising research avenue. This endeavor could enhance model generality and adaptability in even broader contexts.

In conclusion, InternVideo exemplifies a significant stride in video foundation modeling, emphasizing efficiency and versatility. Its success across an extensive array of datasets marks a pivotal contribution to the video understanding community, with potential long-term impacts on both theoretical investigation and practical applications.

PDF Markdown Bookmark Chat (Pro)

Authors (17)

Yi Wang (1038 papers)
Yizhuo Li (21 papers)
Yinan He (34 papers)
Bingkun Huang (5 papers)
Zhiyu Zhao (18 papers)
Hongjie Zhang (21 papers)
Jilan Xu (32 papers)
Yi Liu (543 papers)
Zun Wang (42 papers)
Sen Xing (6 papers)
Guo Chen (107 papers)
Junting Pan (30 papers)
Jiashuo Yu (19 papers)
Yali Wang (78 papers)
Limin Wang (221 papers)
Yu Qiao (563 papers)
KunChang Li (43 papers)

Citations (251)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - OpenGVLab/InternVideo: Video Foundation Models & Data for Multimodal Understanding (1,033 stars)