Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DGL: Dynamic Global-Local Prompt Tuning for Text-Video Retrieval (2401.10588v1)

Published 19 Jan 2024 in cs.CV

Abstract: Text-video retrieval is a critical multi-modal task to find the most relevant video for a text query. Although pretrained models like CLIP have demonstrated impressive potential in this area, the rising cost of fully finetuning these models due to increasing model size continues to pose a problem. To address this challenge, prompt tuning has emerged as an alternative. However, existing works still face two problems when adapting pretrained image-text models to downstream video-text tasks: (1) The visual encoder could only encode frame-level features and failed to extract global-level general video information. (2) Equipping the visual and text encoder with separated prompts failed to mitigate the visual-text modality gap. To this end, we propose DGL, a cross-modal Dynamic prompt tuning method with Global-Local video attention. In contrast to previous prompt tuning methods, we employ the shared latent space to generate local-level text and frame prompts that encourage inter-modal interaction. Furthermore, we propose modeling video in a global-local attention mechanism to capture global video information from the perspective of prompt tuning. Extensive experiments reveal that when only 0.67% parameters are tuned, our cross-modal prompt tuning strategy DGL outperforms or is comparable to fully finetuning methods on MSR-VTT, VATEX, LSMDC, and ActivityNet datasets. Code will be available at https://github.com/knightyxp/DGL

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Frozen in time: A joint video and image encoder for end-to-end retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 1728–1738.
  2. A CLIP-Hitchhiker’s Guide to Long Video Retrieval. arXiv preprint arXiv:2205.08508.
  3. Cross modal retrieval with querybank normalisation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5194–5205.
  4. Clip2video: Mastering video-text retrieval via image clip. arXiv preprint arXiv:2106.11097.
  5. X-pool: Cross-modal language-video attention for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5006–5015.
  6. Activitynet: A large-scale video benchmark for human activity understanding. In 2015 IEEE conference on computer vision and pattern recognition (CVPR), 961–970. IEEE.
  7. Parameter-efficient transfer learning for NLP. In International Conference on Machine Learning, 2790–2799. PMLR.
  8. VoP: Text-Video Co-Operative Prompt Tuning for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 6565–6574.
  9. Visual prompt tuning. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII, 709–727. Springer.
  10. Cross-modal adapter for text-video retrieval. arXiv preprint arXiv:2211.09623.
  11. Video-text as game players: Hierarchical banzhaf interaction for cross-modal representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2472–2482.
  12. Diffusionret: Generative text-video retrieval with diffusion model. arXiv preprint arXiv:2303.09867.
  13. Prompting visual-language models for efficient video understanding. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXV, 105–124. Springer.
  14. Maple: Multi-modal prompt learning. arXiv preprint arXiv:2210.03117.
  15. Less is more: Clipbert for video-and-language learning via sparse sampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 7331–7341.
  16. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691.
  17. BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In ICML.
  18. Local-global context aware transformer for language-guided video segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence.
  19. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602.
  20. Ts2-net: Token shift and selection transformer for text-video retrieval. In Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XIV, 319–335. Springer.
  21. Token Mixing: Parameter-Efficient Transfer Learning from Image-Language to Video-Language. In Proceedings of the 37th AAAI Conference on Artificial Intelligence (AAAI).
  22. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, 508: 293–304.
  23. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia, 638–647.
  24. St-adapter: Parameter-efficient image-to-video transfer learning. Advances in Neural Information Processing Systems, 35: 26462–26477.
  25. Learning transferable visual models from natural language supervision. In International conference on machine learning, 8748–8763. PMLR.
  26. A dataset for movie description. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3202–3212.
  27. Videobert: A joint model for video and language representation learning. In Proceedings of the IEEE/CVF international conference on computer vision, 7464–7473.
  28. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5227–5237.
  29. Disentangled representation learning for text-video retrieval. arXiv preprint arXiv:2203.07111.
  30. Image as a foreign language: Beit pretraining for all vision and vision-language tasks. arXiv preprint arXiv:2208.10442.
  31. Vatex: A large-scale, high-quality multilingual dataset for video-and-language research. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 4581–4591.
  32. T2vlad: global-local sequence alignment for text-video retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 5079–5088.
  33. Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval? In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10704–10713.
  34. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of the IEEE conference on computer vision and pattern recognition, 5288–5296.
  35. CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment. arXiv preprint arXiv:2209.06430.
  36. Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225.
  37. Multimodal video adapter for parameter efficient video text retrieval. arXiv preprint arXiv:2301.07868.
  38. Centerclip: Token clustering for efficient text-video retrieval. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, 970–981.
  39. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
  40. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
  41. Actbert: Learning global-local video-text representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8746–8755.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Xiangpeng Yang (3 papers)
  2. Linchao Zhu (78 papers)
  3. Xiaohan Wang (91 papers)
  4. Yi Yang (855 papers)
Citations (11)