Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition (2403.01560v2)
Abstract: Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.
- Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
- ViViT: A video vision transformer. In IEEE/CVF International Conference on Computer Vision, 2021.
- Is space-time attention all you need for video understanding? In International Conference on Machine Learning, 2021.
- Rethinking zero-shot video classification: End-to-end training for realistic applications. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Open set domain adaptation for image and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):413–429, 2020.
- Quo Vadis, Action Recognition? A new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Temporal attentive alignment for large-scale video domain adaptation. In IEEE/CVF International Conference on Computer Vision, 2019.
- Elaborative rehearsal for zero-shot action recognition. In IEEE/CVF International Conference on Computer Vision, 2021.
- VindLU: A recipe for effective video-and-language pretraining. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing, 2014.
- Why Can’t I Dance in the Mall? Learning to mitigate scene bias in action recognition. In Advances in Neural Information Processing Systems, 2019.
- Unsupervised and semi-supervised domain adaptation for action recognition from drones. In IEEE Winter Conference on Applications of Computer Vision, 2020.
- MeViS: A large-scale benchmark for video segmentation with motion expressions. In IEEE/CVF International Conference on Computer Vision, 2023a.
- MOSE: A new dataset for video object segmentation in complex scenes. In IEEE/CVF International Conference on Computer Vision, 2023b.
- A survey of vision-language pre-trained models. In International Joint Conference on Artificial Intelligence, 2022.
- Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- MIST: multiple instance self-training framework for video anomaly detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- MixCon3D: Synergizing multi-view and cross-modal contrastive learning for enhancing 3d representation. CoRR, abs/2311.01734, 2023.
- Video action transformer network. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
- Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022.
- Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
- Cross-modal consensus network for weakly supervised temporal action localization. In ACM International Conference on Multimedia, 2021.
- Clover: Towards a unified video-language alignment and fusion model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Averaging weights leads to wider optima and better generalization. In Uncertainty in Artificial Intelligence, 2018.
- Deep domain adaptation in action space. In British Machine Vision Conference, 2018.
- Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, 2022.
- Learning cross-modal contrastive features for video domain adaptation. In IEEE/CVF International Conference on Computer Vision, 2021.
- Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
- ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
- HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, 2011.
- Language-driven semantic segmentation. In International Conference on Learning Representations, 2022a.
- Align and Prompt: Video-and-language pre-training with entity prompts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
- Mitigating and evaluating static bias of action representations in the background and the foreground. In IEEE/CVF International Conference on Computer Vision, 2023a.
- Align before Fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, 2021a.
- CT-Net: Channel tensorization network for video classification. In International Conference on Learning Representations, 2021b.
- RESOUND: Towards action recognition without representation bias. In European Conference on Computer Vision, 2018.
- Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023b.
- TSM: Temporal shift module for efficient video understanding. In IEEE/CVF International Conference on Computer Vision, 2019.
- Diversifying spatial-temporal perception for video domain generalization. In Advances in Neural Information Processing Systems, 2023.
- Frozen CLIP models are efficient video learners. In European Conference on Computer Vision, 2022.
- Revisiting temporal modeling for CLIP-based image-to-video knowledge transferring. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Adversarial bipartite graph learning for video domain adaptation. In ACM International Conference on Multimedia, 2020.
- Verbs in action: Improving verb understanding in video-language models. In IEEE/CVF International Conference on Computer Vision, 2023.
- Multi-modal domain adaptation for fine-grained action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Video transformer network. CoRR, abs/2102.00719, 2021.
- Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, 2022.
- OpenAI. GPT-4 technical report, 2023.
- Adversarial cross-domain action recognition with co-attention. In AAAI Conference on Artificial Intelligence, 2020.
- St-adapter: Parameter-efficient image-to-video transfer learning. In Advances in Neural Information Processing Systems, 2022.
- Domain generalization through audio-visual relative norm alignment in first person action recognition. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2022.
- What can a cook in italy teach a mechanic in india? Action recognition generalisation over scenarios and locations. In IEEE/CVF International Conference on Computer Vision, 2023.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
- Fine-tuned CLIP models are efficient video learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Contrast and mix: Temporal contrastive video domain adaptation with background mixing. In Advances in Neural Information Processing Systems, 2021.
- Temporal interlacing network. In AAAI Conference on Artificial Intelligence, 2020.
- Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, 2014.
- Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
- Spatio-temporal contrastive domain adaptation for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
- Gate-shift networks for video action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
- Human action recognition across datasets by foreground-weighted histogram decomposition. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
- Human action recognition from various data modalities: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3200–3225, 2023.
- Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 2014.
- Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
- Learning spatiotemporal features with 3D convolutional networks. In IEEE International Conference on Computer Vision, 2015.
- A closer look at spatiotemporal convolutions for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
- Video classification with channel-separated convolutional networks. In IEEE/CVF International Conference on Computer Vision, 2019.
- Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
- Removing the background by adding the background: Towards background robust self-supervised video representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2021a.
- OmniVL: One foundation model for image-language and video-language tasks. In Advances in Neural Information Processing Systems, 2022.
- Temporal Segment Networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, 2016.
- ActionCLIP: A new paradigm for video action recognition. CoRR, abs/2109.08472, 2021b.
- Open-VCLIP: Transforming CLIP to an open-vocabulary video model via interpolated weight optimization. In International Conference on Machine Learning, 2023.
- Towards open vocabulary learning: A survey. CoRR, abs/2306.15880, 2023a.
- Revisiting classifier: Transferring vision-language models for video recognition. In AAAI Conference on Artificial Intelligence, 2023b.
- VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Empirical Methods in Natural Language Processing, 2021a.
- Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. Image and Vision Computing, 55:127–137, 2016.
- ARID: A comprehensive study on recognizing actions in the dark and a new benchmark dataset. CoRR, abs/2006.03876, 2020.
- Partial video domain adaptation with partial adversarial temporal attentive network. In IEEE/CVF International Conference on Computer Vision, 2021b.
- Source-free video domain adaptation by learning temporal consistency for action recognition. In European Conference on Computer Vision, 2022.
- Aligning correlation information for domain adaptation in action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2023a.
- Multi-source video domain adaptation with temporal attentive moment alignment network. IEEE Transactions on Circuits and Systems for Video Technology, 33(8):3860–3871, 2023b.
- ULIP: learning a unified representation of language, images, and point clouds for 3d understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- VideoDG: Generalizing temporal relations in videos to novel domains. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7989–8004, 2022.
- CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
- Task residual for tuning vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
- Florence: A new foundation model for computer vision. CoRR, abs/2111.11432, 2021.
- Token shift transformer for video classification. In ACM International Conference on Multimedia, 2021a.
- V4D: 4D convolutional neural networks for video-level representation learning. In International Conference on Learning Representations, 2020.
- VidTr: Video transformer without convolutions. In IEEE/CVF International Conference on Computer Vision, 2021b.
- Audio-adaptive activity recognition across video domains. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
- Temporal relational reasoning in videos. In European Conference on Computer Vision, 2018.
- Graph-based high-order relation modeling for long-term action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
- ActionHub: A large-scale action video description dataset for zero-shot action recognition. CoRR, abs/2401.11654, 2024.
- Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022a.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
- Kun-Yu Lin (24 papers)
- Henghui Ding (87 papers)
- Jiaming Zhou (41 papers)
- Yi-Xing Peng (9 papers)
- Zhilin Zhao (12 papers)
- Chen Change Loy (288 papers)
- Wei-Shi Zheng (148 papers)
- Yu-Ming Tang (11 papers)