Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Progressive Multi-modal Conditional Prompt Tuning (2404.11864v2)

Published 18 Apr 2024 in cs.CV

Abstract: Pre-trained vision-LLMs (VLMs) have shown remarkable generalization capabilities via prompting, which leverages VLMs as knowledge bases to extract information beneficial for downstream tasks. However, existing methods primarily employ uni-modal prompting, which only engages a uni-modal branch, failing to simultaneously adjust vision-language (V-L) features. Additionally, the one-pass forward pipeline in VLM encoding struggles to align V-L features that have a huge gap. Confronting these challenges, we propose a novel method, Progressive Multi-modal conditional Prompt Tuning (ProMPT). ProMPT exploits a recurrent structure, optimizing and aligning V-L features by iteratively utilizing image and current encoding information. It comprises an initialization and a multi-modal iterative evolution (MIE) module. Initialization is responsible for encoding images and text using a VLM, followed by a feature filter that selects text features similar to image. MIE then facilitates multi-modal prompting through class-conditional vision prompting, instance-conditional text prompting, and feature filtering. In each MIE iteration, vision prompts are obtained from filtered text features via a vision generator, promoting image features to focus more on target object during vision prompting. The encoded image features are fed into a text generator to produce text prompts that are more robust to class shifts. Thus, V-L features are progressively aligned, enabling advance from coarse to exact prediction. Extensive experiments are conducted in three settings to evaluate the efficacy of ProMPT. The results indicate that ProMPT outperforms existing methods on average across all settings, demonstrating its superior generalization and robustness. Code is available at https://github.com/qiuxiaoyu9954/ProMPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (55)
  1. VISIONE: a large-scale video retrieval system with advanced search functionalities. In ICMR. 649–653.
  2. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022).
  3. Food-101–mining discriminative components with random forests. In ECCV. 446–461.
  4. Describing textures in the wild. In CVPR. 3606–3613.
  5. ImageNet: A large-scale hierarchical image database. In CVPR. 248–255.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
  7. Learning to prompt for open-vocabulary object detection with vision-language model. In CVPR. 14084–14093.
  8. Transferring image-CLIP to video-text retrieval via temporal relations. TMM (2022).
  9. Li Fei-Fei. 2004. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPRW. IEEE, 178–178.
  10. Recurrent Generic Contour-based Instance Segmentation with Progressive Learning. TCSVT (2024).
  11. DocScanner: Robust document image rectification with progressive learning. arXiv preprint arXiv:2110.14968 (2021).
  12. CLIP-Adapter: Better vision-language models with feature adapters. IJCV (2023), 1–15.
  13. Open-vocabulary object detection via vision and language knowledge distillation. In ICLR.
  14. PTR: Prompt tuning with rules for text classification. AI Open (2022), 182–192.
  15. Deep residual learning for image recognition. In CVPR. 770–778.
  16. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. JSTARS (2019), 2217–2226.
  17. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV. 8340–8349.
  18. Natural adversarial examples. In CVPR. 15262–15271.
  19. Scaling up visual and vision-language representation learning with noisy text supervision. In ICML. 4904–4916.
  20. Visual prompt tuning. In ECCV. 709–727.
  21. A good prompt is worth millions of parameters: low-resource prompt-based learning for vision-language models. In ACL. 2763–2775.
  22. Prompting visual-language models for efficient video understanding. In ECCV. 105–124.
  23. MaPLe: Multi-modal prompt learning. In CVPR. 19113–19122.
  24. 3D object representations for fine-grained categorization. In ICCVW. 554–561.
  25. Teven Le Scao and Alexander M Rush. 2021. How many data points is a prompt worth?. In NAACL. 2627–2636.
  26. Multimodal prompting with missing modalities for visual recognition. In CVPR. 14943–14952.
  27. The power of scale for parameter-efficient prompt tuning. In EMNLP. 3045–3059.
  28. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: optimizing continuous prompts for generation. In ACL. 4582–4597.
  29. VideoCLIP: A cross-attention model for fast video-text retrieval task with image clip. In ICMR. 29–33.
  30. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602 (2021).
  31. GPT understands, too. AI Open (2023).
  32. Prompt distribution learning. In CVPR. 5206–5215.
  33. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013).
  34. Cross-task generalization via natural language crowdsourcing instructions. In AC. 3470–3487.
  35. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In ICVGIP. 722–729.
  36. Cats and dogs. In CVPR. 3498–3505.
  37. Language models as knowledge bases?. In EMNLP-IJCNLP. 2463–2473.
  38. Learning transferable visual models from natural language supervision. In ICML. 8748–8763.
  39. DenseCLIP: Language-guided dense prediction with context-aware prompting. In CVPR. 18082–18091.
  40. Do ImageNet classifiers generalize to ImageNet?. In ICML. 5389–5400.
  41. ProposalCLIP: Unsupervised open-category object proposal generation via exploiting CLIP cues. In CVPR. 9611–9620.
  42. AutoPrompt: Eliciting knowledge from language models with automatically generated prompts. In EMNLP. 4222–4235.
  43. Test-time prompt tuning for zero-Shot generalization in vision-language models. In NeurIPS.
  44. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012).
  45. Attention is all you need. In NeurIPS. 6000–6010.
  46. Universal adversarial triggers for attacking and analyzing NLP. In EMNLP-IJCNLP. 2153–2162.
  47. Learning robust global representations by penalizing local predictive power. NeurIPS.
  48. Sun database: Large-scale scene recognition from abbey to zoo. In CVPR. 3485–3492.
  49. Visual-language prompt tuning with knowledge-guided context optimization. In CVPR. 6757–6767.
  50. Florence: A new foundation model for computer vision. arXiv preprint arXiv:2111.11432 (2021).
  51. Tip-Adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021).
  52. Conditional prompt learning for vision-language models. In CVPR. 16816–16825.
  53. Learning to prompt for vision-language models. IJCV (2022), 2337–2348.
  54. Prompt-aligned gradient for prompt tuning. In ICCV. 15659–15669.
  55. CLIP4Hashing: unsupervised deep hashing for cross-modal video-text retrieval. In ICMR. 158–166.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Xiaoyu Qiu (5 papers)
  2. Hao Feng (83 papers)
  3. Yuechen Wang (9 papers)
  4. Wengang Zhou (153 papers)
  5. Houqiang Li (236 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com