Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Rethinking CLIP-based Video Learners in Cross-Domain Open-Vocabulary Action Recognition (2403.01560v2)

Published 3 Mar 2024 in cs.CV

Abstract: Building upon the impressive success of CLIP (Contrastive Language-Image Pretraining), recent pioneer works have proposed to adapt the powerful CLIP to video data, leading to efficient and effective video learners for open-vocabulary action recognition. Inspired by that humans perform actions in diverse environments, our work delves into an intriguing question: Can CLIP-based video learners effectively generalize to video domains they have not encountered during training? To answer this, we establish a CROSS-domain Open-Vocabulary Action recognition benchmark named XOV-Action, and conduct a comprehensive evaluation of five state-of-the-art CLIP-based video learners under various types of domain gaps. The evaluation demonstrates that previous methods exhibit limited action recognition performance in unseen video domains, revealing potential challenges of the cross-domain open-vocabulary action recognition task. In this paper, we focus on one critical challenge of the task, namely scene bias, and accordingly contribute a novel scene-aware video-text alignment method. Our key idea is to distinguish video representations apart from scene-encoded text representations, aiming to learn scene-agnostic video representations for recognizing actions across domains. Extensive experiments demonstrate the effectiveness of our method. The benchmark and code will be available at https://github.com/KunyuLin/XOV-Action/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (97)
  1. Flamingo: A visual language model for few-shot learning. In Advances in Neural Information Processing Systems, 2022.
  2. ViViT: A video vision transformer. In IEEE/CVF International Conference on Computer Vision, 2021.
  3. Is space-time attention all you need for video understanding? In International Conference on Machine Learning, 2021.
  4. Rethinking zero-shot video classification: End-to-end training for realistic applications. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  5. Open set domain adaptation for image and action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):413–429, 2020.
  6. Quo Vadis, Action Recognition? A new model and the kinetics dataset. In IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  7. Temporal attentive alignment for large-scale video domain adaptation. In IEEE/CVF International Conference on Computer Vision, 2019.
  8. Elaborative rehearsal for zero-shot action recognition. In IEEE/CVF International Conference on Computer Vision, 2021.
  9. VindLU: A recipe for effective video-and-language pretraining. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  10. Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Empirical Methods in Natural Language Processing, 2014.
  11. Why Can’t I Dance in the Mall? Learning to mitigate scene bias in action recognition. In Advances in Neural Information Processing Systems, 2019.
  12. Unsupervised and semi-supervised domain adaptation for action recognition from drones. In IEEE Winter Conference on Applications of Computer Vision, 2020.
  13. MeViS: A large-scale benchmark for video segmentation with motion expressions. In IEEE/CVF International Conference on Computer Vision, 2023a.
  14. MOSE: A new dataset for video object segmentation in complex scenes. In IEEE/CVF International Conference on Computer Vision, 2023b.
  15. A survey of vision-language pre-trained models. In International Joint Conference on Artificial Intelligence, 2022.
  16. Christoph Feichtenhofer. X3D: Expanding architectures for efficient video recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  17. MIST: multiple instance self-training framework for video anomaly detection. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  18. MixCon3D: Synergizing multi-view and cross-modal contrastive learning for enhancing 3d representation. CoRR, abs/2311.01734, 2023.
  19. Video action transformer network. In IEEE Conference on Computer Vision and Pattern Recognition, 2019.
  20. Open-vocabulary object detection via vision and language knowledge distillation. In International Conference on Learning Representations, 2022.
  21. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2016.
  22. Cross-modal consensus network for weakly supervised temporal action localization. In ACM International Conference on Multimedia, 2021.
  23. Clover: Towards a unified video-language alignment and fusion model. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  24. Averaging weights leads to wider optima and better generalization. In Uncertainty in Artificial Intelligence, 2018.
  25. Deep domain adaptation in action space. In British Machine Vision Conference, 2018.
  26. Prompting visual-language models for efficient video understanding. In European Conference on Computer Vision, 2022.
  27. Learning cross-modal contrastive features for video domain adaptation. In IEEE/CVF International Conference on Computer Vision, 2021.
  28. Human action recognition and prediction: A survey. International Journal of Computer Vision, 130(5):1366–1401, 2022.
  29. ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems, 2012.
  30. HMDB: A large video database for human motion recognition. In IEEE International Conference on Computer Vision, 2011.
  31. Language-driven semantic segmentation. In International Conference on Learning Representations, 2022a.
  32. Align and Prompt: Video-and-language pre-training with entity prompts. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022b.
  33. Mitigating and evaluating static bias of action representations in the background and the foreground. In IEEE/CVF International Conference on Computer Vision, 2023a.
  34. Align before Fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems, 2021a.
  35. CT-Net: Channel tensorization network for video classification. In International Conference on Learning Representations, 2021b.
  36. RESOUND: Towards action recognition without representation bias. In European Conference on Computer Vision, 2018.
  37. Scaling language-image pre-training via masking. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023b.
  38. TSM: Temporal shift module for efficient video understanding. In IEEE/CVF International Conference on Computer Vision, 2019.
  39. Diversifying spatial-temporal perception for video domain generalization. In Advances in Neural Information Processing Systems, 2023.
  40. Frozen CLIP models are efficient video learners. In European Conference on Computer Vision, 2022.
  41. Revisiting temporal modeling for CLIP-based image-to-video knowledge transferring. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  42. Adversarial bipartite graph learning for video domain adaptation. In ACM International Conference on Multimedia, 2020.
  43. Verbs in action: Improving verb understanding in video-language models. In IEEE/CVF International Conference on Computer Vision, 2023.
  44. Multi-modal domain adaptation for fine-grained action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  45. Video transformer network. CoRR, abs/2102.00719, 2021.
  46. Expanding language-image pretrained models for general video recognition. In European Conference on Computer Vision, 2022.
  47. OpenAI. GPT-4 technical report, 2023.
  48. Adversarial cross-domain action recognition with co-attention. In AAAI Conference on Artificial Intelligence, 2020.
  49. St-adapter: Parameter-efficient image-to-video transfer learning. In Advances in Neural Information Processing Systems, 2022.
  50. Domain generalization through audio-visual relative norm alignment in first person action recognition. In IEEE/CVF Winter Conference on Applications of Computer Vision, 2022.
  51. What can a cook in italy teach a mechanic in india? Action recognition generalisation over scenarios and locations. In IEEE/CVF International Conference on Computer Vision, 2023.
  52. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, 2021.
  53. Fine-tuned CLIP models are efficient video learners. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  54. Contrast and mix: Temporal contrastive video domain adaptation with background mixing. In Advances in Neural Information Processing Systems, 2021.
  55. Temporal interlacing network. In AAAI Conference on Artificial Intelligence, 2020.
  56. Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems, 2014.
  57. Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations, 2015.
  58. Spatio-temporal contrastive domain adaptation for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  59. UCF101: A dataset of 101 human actions classes from videos in the wild. CoRR, abs/1212.0402, 2012.
  60. Gate-shift networks for video action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020.
  61. Human action recognition across datasets by foreground-weighted histogram decomposition. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  62. Human action recognition from various data modalities: A review. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3):3200–3225, 2023.
  63. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems, 2014.
  64. Going deeper with convolutions. In IEEE Conference on Computer Vision and Pattern Recognition, 2015.
  65. Learning spatiotemporal features with 3D convolutional networks. In IEEE International Conference on Computer Vision, 2015.
  66. A closer look at spatiotemporal convolutions for action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2018.
  67. Video classification with channel-separated convolutional networks. In IEEE/CVF International Conference on Computer Vision, 2019.
  68. Attention is all you need. In Advances in Neural Information Processing Systems, 2017.
  69. Removing the background by adding the background: Towards background robust self-supervised video representation learning. In IEEE Conference on Computer Vision and Pattern Recognition, 2021a.
  70. OmniVL: One foundation model for image-language and video-language tasks. In Advances in Neural Information Processing Systems, 2022.
  71. Temporal Segment Networks: Towards good practices for deep action recognition. In European Conference on Computer Vision, 2016.
  72. ActionCLIP: A new paradigm for video action recognition. CoRR, abs/2109.08472, 2021b.
  73. Open-VCLIP: Transforming CLIP to an open-vocabulary video model via interpolated weight optimization. In International Conference on Machine Learning, 2023.
  74. Towards open vocabulary learning: A survey. CoRR, abs/2306.15880, 2023a.
  75. Revisiting classifier: Transferring vision-language models for video recognition. In AAAI Conference on Artificial Intelligence, 2023b.
  76. VideoCLIP: Contrastive pre-training for zero-shot video-text understanding. In Empirical Methods in Natural Language Processing, 2021a.
  77. Dual many-to-one-encoder-based transfer learning for cross-dataset human action recognition. Image and Vision Computing, 55:127–137, 2016.
  78. ARID: A comprehensive study on recognizing actions in the dark and a new benchmark dataset. CoRR, abs/2006.03876, 2020.
  79. Partial video domain adaptation with partial adversarial temporal attentive network. In IEEE/CVF International Conference on Computer Vision, 2021b.
  80. Source-free video domain adaptation by learning temporal consistency for action recognition. In European Conference on Computer Vision, 2022.
  81. Aligning correlation information for domain adaptation in action recognition. IEEE Transactions on Neural Networks and Learning Systems, 2023a.
  82. Multi-source video domain adaptation with temporal attentive moment alignment network. IEEE Transactions on Circuits and Systems for Video Technology, 33(8):3860–3871, 2023b.
  83. ULIP: learning a unified representation of language, images, and point clouds for 3d understanding. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  84. Interact before align: Leveraging cross-modal knowledge for domain adaptive action recognition. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  85. VideoDG: Generalizing temporal relations in videos to novel domains. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(11):7989–8004, 2022.
  86. CoCa: Contrastive captioners are image-text foundation models. Transactions on Machine Learning Research, 2022.
  87. Task residual for tuning vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023.
  88. Florence: A new foundation model for computer vision. CoRR, abs/2111.11432, 2021.
  89. Token shift transformer for video classification. In ACM International Conference on Multimedia, 2021a.
  90. V4D: 4D convolutional neural networks for video-level representation learning. In International Conference on Learning Representations, 2020.
  91. VidTr: Video transformer without convolutions. In IEEE/CVF International Conference on Computer Vision, 2021b.
  92. Audio-adaptive activity recognition across video domains. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022.
  93. Temporal relational reasoning in videos. In European Conference on Computer Vision, 2018.
  94. Graph-based high-order relation modeling for long-term action recognition. In IEEE Conference on Computer Vision and Pattern Recognition, 2021.
  95. ActionHub: A large-scale action video description dataset for zero-shot action recognition. CoRR, abs/2401.11654, 2024.
  96. Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022a.
  97. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Kun-Yu Lin (24 papers)
  2. Henghui Ding (87 papers)
  3. Jiaming Zhou (41 papers)
  4. Yi-Xing Peng (9 papers)
  5. Zhilin Zhao (12 papers)
  6. Chen Change Loy (288 papers)
  7. Wei-Shi Zheng (148 papers)
  8. Yu-Ming Tang (11 papers)
Citations (8)