Revisiting Classifier: Transferring Vision-Language Models for Video Recognition (2207.01297v4)

Published 4 Jul 2022 in cs.CV

Abstract: Transferring knowledge from task-agnostic pre-trained deep models for downstream tasks is an important topic in computer vision research. Along with the growth of computational capacity, we now have open-source vision-language pre-trained models in large scales of the model architecture and amount of data. In this study, we focus on transferring knowledge for video classification tasks. Conventional methods randomly initialize the linear classifier head for vision classification, but they leave the usage of the text encoder for downstream visual recognition tasks undiscovered. In this paper, we revise the role of the linear classifier and replace the classifier with the different knowledge from pre-trained model. We utilize the well-pretrained LLM to generate good semantic target for efficient transferring learning. The empirical study shows that our method improves both the performance and the training speed of video classification, with a negligible change in the model. Our simple yet effective tuning paradigm achieves state-of-the-art performance and efficient training on various video recognition scenarios, i.e., zero-shot, few-shot, general recognition. In particular, our paradigm achieves the state-of-the-art accuracy of 87.8% on Kinetics-400, and also surpasses previous methods by 20~50% absolute top-1 accuracy under zero-shot, few-shot settings on five popular video datasets. Code and models can be found at https://github.com/whwu95/Text4Vis .

PDF Abstract

Insights on Vision-LLM Transfer for Video Recognition

The paper "Revisiting Classifier: Transferring Vision-LLMs for Video Recognition" presents an innovative approach by leveraging pretrained vision-LLMs for improved video classification performance. The authors focus on addressing the underutilization of linguistic components within current transfer learning paradigms, specifically for video recognition tasks.

Summary of Methodology

The authors critically evaluate the role of the linear classifier, traditionally initialized randomly in standard vision-centric models, to leverage pretrained LLMs in video classification tasks. They propose an alternative paradigm that replaces the conventional classifier with embeddings derived from pretrained LLMs such as CLIP. This leverages the semantic-rich embeddings generated by language pretraining to enhance the accuracy and efficiency of video model training. Their method emphasizes utilizing pretrained textual embeddings as fixed classifiers, facilitating a more semantically informed decision-making process during video recognition.

Empirical Results

Numerical evidence from the empirical paper demonstrates significant performance improvements. Notably, the proposed framework achieves a state-of-the-art accuracy of 87.8% on the Kinetics-400 dataset, a respected benchmark in the domain. Furthermore, in zero-shot and few-shot scenarios, the method achieves superior results, surpassing existing methods by an absolute margin of 20% to 50% in top-1 accuracy across five distinct video datasets. These results underscore the paradigm's efficacy in scenarios with limited labeled data availability, which is a substantial advancement in the field.

Implications and Future Developments

The approach delineated in this paper has far-reaching implications. The creation of a semantically enriched classifier for video recognition paves the way for robust models that perform well under constrained data settings. This has pragmatic applications in fields where data collection is challenging or expensive, and annotations are limited. Additionally, the demonstrated improvement in training speed suggests potential reductions in computational resources, offering a greener and more computationally efficient methodology.

The fusion of vision and language in model training signals a shift towards more holistic representations across domains. Future explorations could delve into refining initialization strategies for the embeddings or extending this transfer approach to other domains within computer vision, such as object detection or instance segmentation. The theoretical exploration into the complementary roles of vision and language in neural network architecture could further unlock efficiencies and performance gains.

By unveiling new paths in integrating LLMs into vision tasks, this research provides new insights and methodologies that push the boundaries of what's achievable in video recognition accuracy, particularly in the field of transfer learning, where maximizing prelearned representations holds the key to unlocking higher model performance with less data.

PDF Markdown Bookmark Chat (Pro)

Authors (3)

Wenhao Wu (71 papers)
Zhun Sun (18 papers)
Wanli Ouyang (358 papers)

Citations (79)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - whwu95/Text4Vis: 【AAAI'2023 & IJCV】Transferring Vision-Language Models for Visual Recognition: A Classifier Perspective (192 stars)