Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting (2304.03307v1)

Published 6 Apr 2023 in cs.CV and eess.IV

Abstract: Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at https://github.com/TalalWasim/Vita-CLIP.

PDF Abstract

An Analysis of Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting

The paper "Vita-CLIP: Video and Text Adaptive CLIP via Multimodal Prompting" explores an innovative approach to extending the capabilities of the Contrastive Language–Image Pre-training (CLIP) model to video classification tasks. The authors address the challenge of balancing the trade-off between supervised performance and zero-shot generalization, which has been a persistent issue in adapting image-text pretrained models like CLIP to the video domain.

The core contribution of the paper lies in the development of a multimodal prompting method that effectively balances the fine-tuning of the pretrained models for both supervised and zero-shot tasks without compromising their generalization abilities. This is achieved through a novel prompting strategy that operates on both the vision and text components of CLIP.

Technical Overview

The authors introduce a prompting scheme on the vision side, which comprises three elements: global video-level prompts, local frame-level prompts, and a summary prompt. The global prompts provide adaptability to new video data distributions, while local prompts ensure frame-level discriminative information is effectively captured. The summary prompt synthesizes information across frames, thus helping in modeling temporal dynamics.

On the text side, the authors propose the use of learnable text context vectors instead of manually crafted ones, allowing the LLM to adapt better to new video tasks. This enhances the textual encoding, crucial for matching video representations with relevant text descriptions.

Results

The model demonstrates state-of-the-art zero-shot performance across standard datasets like Kinetics-600, HMDB51, and UCF101, showcasing gains of 2-4% over previous approaches. In supervised settings, Vita-CLIP remains competitive with models that fine-tune the entire CLIP backbone.

A notable aspect of Vita-CLIP is its efficiency. By keeping the CLIP backbone frozen and training only a minimal set of parameters (comprising the prompt learning modules), it achieves strong performance with significantly reduced computational overhead. This efficiency does not come at the cost of generalization, a typical pitfall in methods that fine-tune extensive parts of the model.

Practical and Theoretical Implications

Practically, Vita-CLIP offers a unified framework for video understanding tasks that necessitate both high supervised accuracy and robust zero-shot generalization. This makes it a versatile tool for applications where labeled video data might be scarce, yet large-scale deployment is required.

Theoretically, this work underscores the potential of prompt learning for multi-modal tasks, extending the utility of pre-trained language-image models into more complex domains like video. It challenges conventional approaches that rely heavily on model finetuning, suggesting instead that strategic parameter addition through prompts can suffice to adapt models to new contexts efficiently.

Future Directions

The paper hints at several avenues for future work:

Extension of the proposed framework to other multi-modal tasks beyond video classification, potentially including more complex scenarios like video retrieval or captioning.
Exploration of more sophisticated prompt conditioning techniques, which could leverage additional contextual information or learned dynamics in the data.

In sum, Vita-CLIP introduces a compelling method for adapting powerful language-image models to video tasks, striking a balance between parameter efficiency and performance. This work not only advances the state-of-the-art in video understanding but also provides a blueprint for similar adaptations in other domains.

PDF Markdown Bookmark Chat (Pro)

Authors (5)

Syed Talal Wasim (11 papers)
Muzammal Naseer (67 papers)
Salman Khan (244 papers)
Fahad Shahbaz Khan (225 papers)
Mubarak Shah (208 papers)

Citations (60)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - TalalWasim/Vita-CLIP: Official repository for "Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting" [CVPR 2023] (120 stars)