Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data (2310.05010v1)

Published 8 Oct 2023 in cs.CV

Abstract: Despite significant results achieved by Contrastive Language-Image Pretraining (CLIP) in zero-shot image recognition, limited effort has been made exploring its potential for zero-shot video recognition. This paper presents Open-VCLIP++, a simple yet effective framework that adapts CLIP to a strong zero-shot video classifier, capable of identifying novel actions and events during testing. Open-VCLIP++ minimally modifies CLIP to capture spatial-temporal relationships in videos, thereby creating a specialized video classifier while striving for generalization. We formally demonstrate that training Open-VCLIP++ is tantamount to continual learning with zero historical data. To address this problem, we introduce Interpolated Weight Optimization, a technique that leverages the advantages of weight interpolation during both training and testing. Furthermore, we build upon LLMs to produce fine-grained video descriptions. These detailed descriptions are further aligned with video features, facilitating a better transfer of CLIP to the video domain. Our approach is evaluated on three widely used action recognition datasets, following a variety of zero-shot evaluation protocols. The results demonstrate that our method surpasses existing state-of-the-art techniques by significant margins. Specifically, we achieve zero-shot accuracy scores of 88.1%, 58.7%, and 81.2% on UCF, HMDB, and Kinetics-600 datasets respectively, outpacing the best-performing alternative methods by 8.5%, 8.2%, and 12.3%. We also evaluate our approach on the MSR-VTT video-text retrieval dataset, where it delivers competitive video-to-text and text-to-video retrieval performance, while utilizing substantially less fine-tuning data compared to other methods. Code is released at https://github.com/wengzejia1/Open-VCLIP.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Zuxuan Wu (144 papers)
Zejia Weng (13 papers)
Wujian Peng (8 papers)
Xitong Yang (27 papers)
Ang Li (472 papers)
Larry S. Davis (98 papers)
Yu-Gang Jiang (223 papers)

Citations (10)

View on Semantic Scholar

Building an Open-Vocabulary Video CLIP Model with Better Architectures, Optimization and Data (2310.05010v1)

Related Papers