Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition (2403.01560v2)

Published 3 Mar 2024 in cs.CV

Abstract: Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (97)

Authors (8)

Kun-Yu Lin (24 papers)
Henghui Ding (87 papers)
Jiaming Zhou (41 papers)
Yi-Xing Peng (9 papers)
Zhilin Zhao (12 papers)
Chen Change Loy (288 papers)
Wei-Shi Zheng (148 papers)
Yu-Ming Tang (11 papers)

Citations (8)

View on Semantic Scholar

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition (2403.01560v2)

Related Papers

GitHub