Expanding Language-Image Pretrained Models for General Video Recognition (2208.02816v1)

Published 4 Aug 2022 in cs.CV

Abstract: Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP

PDF Abstract

Expanding Language-Image Pretrained Models for General Video Recognition

The paper "Expanding Language-Image Pretrained Models for General Video Recognition" presents an innovative approach to adapting language-image pretrained models for video recognition tasks, focusing on transferring the robust capabilities of language-image models to video understanding. The proposed method leverages the powerful representation abilities of large-scale language-image pairs, exemplified by models like CLIP and Florence, and effectively extends these to capture the temporal dynamics inherent in video data.

The primary challenge addressed in this paper is the lack of scalable and computationally feasible approaches to train language-video models directly from scratch, primarily due to the substantial resource requirements for video-text pretraining. Instead, the authors introduce a novel adaptation mechanism that utilizes the existing pretrained language-image models, optimizing them for video domain applications. This adaptation is realized through two key innovations: a cross-frame attention mechanism and a video-specific prompting scheme.

Methodology

The cross-frame attention mechanism is a significant contribution of this work. It permits the explicit exchange of information across different frames in a video, effectively allowing the model to capture long-range dependencies and temporal dynamics without necessitating dense computation across the entire video sequence. This approach not only enhances the temporal modeling capabilities of the model but also maintains computational efficiency, as demonstrated by the substantial reduction in FLOPs when compared to prior models like Swin-L and ViViT-H.

The authors further incorporate a video-specific prompting scheme that utilizes video content to generate discriminative textual prompts, enhancing the model's ability to classify actions and scenes within the video context effectively. This integration of video-specific context into the LLM framework significantly bolsters the model's generalization capabilities, particularly in scenarios lacking abundant labeled data.

Numerical Results

The empirical evaluation demonstrates the effectiveness of the proposed approach across various scenarios—fully-supervised, zero-shot, and few-shot learning. For instance, the proposed method achieves a top-1 accuracy of 87.1% on Kinetics-400 under fully-supervised settings, while utilizing 12 times fewer FLOPs compared to Swin-L and ViViT-H. Moreover, in zero-shot experiments, the methodology surpasses current state-of-the-art methods by 7.6% and 14.9% under two widely-adopted protocols, underscoring its potent transfer capabilities. The few-shot recognition results are equally compelling, showing improvements of 32.1% and 23.1% over previous methods when labeled data is extremely limited. These results collectively highlight the approach's generality and adaptability across different video recognition tasks.

Implications and Future Directions

The implications of this research are multifaceted. Practically, the ability to adapt language-image models to video domains efficiently extends the applicability of these pretrained models beyond static image recognition to more complex temporal tasks. This approach could significantly enhance the scalability of deploying AI in fields reliant on video data, such as autonomous driving, video surveillance, and media content analysis.

Theoretically, the findings encourage further exploration into cross-modal transfer learning and the development of more robust frameworks that can transition across varying data modalities. The demonstrated success of incorporating video-specific prompting and cross-frame attention opens new avenues in training models that need to balance computational efficiency with rich feature extraction.

Future developments may explore optimizing these models further for real-time video processing or extending the pretrained frameworks to handle multi-modal inputs beyond text and video, integrating audio and other sensory data streams. As AI continues to progress toward more sophisticated and integrated learning paradigms, such cross-modal approaches offer a promising direction for robust, scalable, and versatile AI applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Bolin Ni (11 papers)
Houwen Peng (36 papers)
Minghao Chen (37 papers)
Songyang Zhang (116 papers)
Gaofeng Meng (41 papers)
Jianlong Fu (91 papers)
Shiming Xiang (54 papers)
Haibin Ling (142 papers)

Citations (273)

View on Semantic Scholar

Expanding Language-Image Pretrained Models for General Video Recognition (2208.02816v1)

Expanding Language-Image Pretrained Models for General Video Recognition

Methodology

Numerical Results

Implications and Future Directions

Related Papers