DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval (2401.10588v1)
Abstract: Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL
- Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1728–1738.
- A CLIP-Hitchhiker’s Guide to Long Video Retrieval. arXiv preprint arXiv:2205.08508.
- Cross modal retrieval with querybank normalisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5194–5205.
- Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.
- X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5006–5015.
- Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), 961–970. IEEE.
- Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2790–2799. PMLR.
- VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6565–6574.
- Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, 709–727. Springer.
- Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623.
- Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2472–2482.
- Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867.
- Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 105–124. Springer.
- Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117.
- Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7331–7341.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.
- Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
- Ts2-net: Token shift and selection transformer for text-video retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, 319–335. Springer.
- Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI).
- CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
- X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 638–647.
- St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35: 26462–26477.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
- A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3202–3212.
- Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, 7464–7473.
- Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5227–5237.
- Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111.
- Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
- Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581–4591.
- T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5079–5088.
- Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10704–10713.
- Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5288–5296.
- CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. arXiv preprint arXiv:2209.06430.
- Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225.
- Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868.
- Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 970–981.
- Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
- Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8746–8755.
- Xiangpeng Yang (3 papers)
- Linchao Zhu (78 papers)
- Xiaohan Wang (91 papers)
- Yi Yang (855 papers)