Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

M2-CLIP: A Multimodal, Multi-task Adapting Framework for Video Action Recognition (2401.11649v1)

Published 22 Jan 2024 in cs.CV

Abstract: Recently, the rise of large-scale vision-language pretrained models like CLIP, coupled with the technology of Parameter-Efficient FineTuning (PEFT), has captured substantial attraction in video action recognition. Nevertheless, prevailing approaches tend to prioritize strong supervised performance at the expense of compromising the models' generalization capabilities during transfer. In this paper, we introduce a novel Multimodal, Multi-task CLIP adapting framework named \name to address these challenges, preserving both high supervised performance and robust transferability. Firstly, to enhance the individual modality architectures, we introduce multimodal adapters to both the visual and text branches. Specifically, we design a novel visual TED-Adapter, that performs global Temporal Enhancement and local temporal Difference modeling to improve the temporal representation capabilities of the visual encoder. Moreover, we adopt text encoder adapters to strengthen the learning of semantic label information. Secondly, we design a multi-task decoder with a rich set of supervisory signals to adeptly satisfy the need for strong supervised performance and generalization within a multimodal framework. Experimental results validate the efficacy of our approach, demonstrating exceptional performance in supervised learning while maintaining strong generalization in zero-shot scenarios.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Vivit: A video vision transformer. In ICCV, 6836–6846.
  2. Is Space-Time Attention All You Need for Video Understanding? In ICML, 813–824.
  3. Quo vadis, action recognition? a new model and the kinetics dataset. In CVPR, 6299–6308.
  4. Imagenet: A large-scale hierarchical image database. In CVPR, 248–255. Ieee.
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv preprint arXiv:1810.04805.
  6. Feichtenhofer, C. 2020. X3d: Expanding architectures for efficient video recognition. In CVPR.
  7. Slowfast networks for video recognition. In ICCV, 6202–6211.
  8. The” something something” video database for learning and evaluating visual common sense. In ICCV, 5842–5850.
  9. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2790–2799. PMLR.
  10. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. CoRR, abs/2102.05918.
  11. Stm: Spatiotemporal and motion encoding for action recognition. In ICCV, 2000–2009.
  12. Prompting Visual-Language Models for Efficient Video Understanding. In ECCV. Springer.
  13. The kinetics human action video dataset. arXiv preprint arXiv:1705.06950.
  14. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19113–19122.
  15. HMDB: a large video database for human motion recognition. In ICCV, 2556–2563.
  16. Uniformerv2: Spatiotemporal learning by arming image vits with video uniformer. arXiv preprint arXiv:2211.09552.
  17. Tea: Temporal excitation and aggregation for action recognition. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 909–918.
  18. Improved multiscale vision transformers for classification and detection. In CVPR.
  19. Cross-modal representation learning for zero-shot action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19978–19988.
  20. Tsm: Temporal shift module for efficient video understanding. In ICCV.
  21. Frozen CLIP Models are Efficient Video Learners. arXiv preprint arXiv:2208.03550.
  22. Revisiting temporal modeling for clip-based image-to-video knowledge transferring. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6555–6564.
  23. Video swin transformer. In CVPR.
  24. Expanding Language-Image Pretrained Models for General Video Recognition. In ECCV.
  25. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35: 26462–26477.
  26. Dual-path Adaptation from Image to Video Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2203–2213.
  27. Keeping your eye on the ball: Trajectory attention in video transformers. In NeurIPS.
  28. Learning Transferable Visual Models From Natural Language Supervision. In Meila, M.; and Zhang, T., eds., ICML, volume 139 of Proceedings of Machine Learning Research, 8748–8763. PMLR.
  29. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402.
  30. Implicit temporal modeling with learnable alignment for video recognition. arXiv preprint arXiv:2304.10465.
  31. Temporal segment networks: Towards good practices for deep action recognition. In ECCV, 20–36. Springer.
  32. ActionCLIP: Adapting Language-Image Pretrained Models for Video Action Recognition. IEEE Transactions on Neural Networks and Learning Systems.
  33. Learning spatiotemporal and motion features in a unified 2d network for action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(3): 3347–3362.
  34. Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 23034–23044.
  35. Bidirectional cross-modal knowledge exploration for video recognition with pre-trained vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 6620–6630.
  36. Multimodal Adaptation of CLIP for Few-Shot Action Recognition. arXiv preprint arXiv:2308.01532.
  37. Aim: Adapting image models for efficient video action recognition. arXiv preprint arXiv:2302.03024.
  38. Florence: A New Foundation Model for Computer Vision. CoRR, abs/2111.11432.
  39. Streaming Video Model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 14602–14612.
  40. Conditional Prompt Learning for Vision-Language Models. In CVPR.
  41. Learning to Prompt for Vision-Language Models. IJCV.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Mengmeng Wang (73 papers)
  2. Jiazheng Xing (12 papers)
  3. Boyuan Jiang (22 papers)
  4. Jun Chen (374 papers)
  5. Jianbiao Mei (19 papers)
  6. Xingxing Zuo (36 papers)
  7. Guang Dai (38 papers)
  8. Jingdong Wang (236 papers)
  9. Yong Liu (721 papers)
Citations (1)