Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning (2401.06827v2)

Published 12 Jan 2024 in cs.CV, cs.AI, and cs.CL

Abstract: Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. “Visual prompting: Modifying pixel space to adapt pre-trained models” In arXiv preprint arXiv:2203.17274, 2022
  2. Lukas Bossard, Matthieu Guillaumin and Luc Van Gool “Food-101–mining discriminative components with random forests” In The European Conference on Computer Vision, 2014, pp. 446–461
  3. “Uniter: Learning universal image-text representations” In European Conference on Computer Vision (ECCV), 2020
  4. “Describing textures in the wild” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613
  5. “Imagenet: A large-scale hierarchical image database” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255
  6. “Decoupling zero-shot semantic segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11583–11592
  7. Li Fei-Fei, Rob Fergus and Pietro Perona “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories” In 2004 Conference on Computer Vision and Pattern Recognition Workshop, 2004, pp. 178–178
  8. “Promptdet: Towards open-vocabulary detection using uncurated images” In The European Conference on Computer Vision (ECCV), 2022
  9. “Clip-adapter: Better vision-language models with feature adapters” In arXiv preprint arXiv:2110.04544, 2021
  10. “Container: Context aggregation network” In Conference on Neural Information Processing Systems (NeurIPS), 2021
  11. “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification” In IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12.7, 2019, pp. 2217–2226
  12. “Natural adversarial examples” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15262–15271
  13. “The many faces of robustness: A critical analysis of out-of-distribution generalization” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8340–8349
  14. “Scaling up visual and vision-language representation learning with noisy text supervision” In International Conference on Machine Learning (ICML), 2021
  15. “Visual prompt tuning” In The European Conference on Computer Vision (ECCV), 2022
  16. “Prompting visual-language models for efficient video understanding” In The European Conference on Computer Vision (ECCV), 2021
  17. “MaPLe: Multi-modal Prompt Learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 19113–19122
  18. Jin-Hwa Kim, Jaehyun Jun and Byoung-Tak Zhang “Bilinear attention networks” In Advances in Neural Information Processing Systems (NeurIPS), 2018
  19. “How to adapt your large-scale vision-and-language model” arXiv preprint, 2022
  20. “3D object representations for fine-grained categorization” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2013, pp. 554–561
  21. Brian Lester, Rami Al-Rfou and Noah Constant “The power of scale for parameter-efficient prompt tuning” In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
  22. “Language-driven semantic segmentation” In International Conference on Learning Representations (ICLR), 2022
  23. Xiang Lisa Li and Percy Liang “Prefix-tuning: Optimizing continuous prompts for generation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) and the 11th International Joint Conference on Natural Language Processing (IJCNLP), 2021
  24. “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) and the 11th International Joint Conference on Natural Language Processing (IJCNLP), 2021
  25. “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Advances in Neural Information Processing Systems (NeurIPS), 2019
  26. “Class-agnostic object detection with multimodal transformer” In The European Conference on Computer Vision (ECCV), 2022
  27. “Fine-grained visual classification of aircraft” In arXiv preprint arXiv:1306.5151, 2013
  28. “Automated flower classification over a large number of classes” In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008, pp. 722–729
  29. “Cats and dogs” In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3498–3505
  30. “Learning transferable visual models from natural language supervision” In International Conference on Machine Learning (ICML), 2021
  31. “Denseclip: Language-guided dense prediction with context-aware prompting” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18082–18091
  32. “Do imagenet classifiers generalize to imagenet?” In International Conference on Machine Learning, 2019, pp. 5389–5400
  33. Khurram Soomro, Amir Roshan Zamir and Mubarak Shah “UCF101: A dataset of 101 human actions classes from videos in the wild” In arXiv preprint arXiv:1212.0402, 2012
  34. “LXMERT: Learning cross-modality encoder representations from transformers” In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019
  35. “Learning robust global representations by penalizing local predictive power” In Advances in Neural Information Processing Systems (NeurIPS) 32, 2019
  36. “Dualprompt: Complementary prompting for rehearsal-free continual learning” In The European Conference on Computer Vision (ECCV), 2022
  37. “Learning to prompt for continual learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 139–149
  38. “SUN database: Large-scale scene recognition from abbey to zoo” In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492
  39. “Deep modular co-attention networks for visual question answering” In Conference on Computer Vision and Pattern Recognition (CVPR), 2019
  40. “Open-vocabulary DETR with conditional matching” In The European Conference on Computer Vision (ECCV), 2022
  41. “Tip-adapter: Training-free clip-adapter for better vision-language modeling” In The European Conference on Computer Vision (ECCV), 2022
  42. “CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5906–5916
  43. “Conditional prompt learning for vision-language models” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16816–16825
  44. “Learning to prompt for vision-language models” In International Journal of Computer Vision (IJCV), 2022, pp. 1–12
  45. “Prompt-aligned gradient for prompt tuning” In arXiv preprint arXiv:2205.14865, 2022
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Guiming Cao (1 paper)
  2. Kaize Shi (9 papers)
  3. Hong Fu (6 papers)
  4. Huaiwen Zhang (9 papers)
  5. Guandong Xu (93 papers)
Citations (1)