APLe: Token-Wise Adaptive for Multi-Modal Prompt Learning (2401.06827v2)
Abstract: Pre-trained Vision-Language (V-L) models set the benchmark for generalization to downstream tasks among the noteworthy contenders. Many characteristics of the V-L model have been explored in existing research including the challenge of the sensitivity to text input and the tuning process across multi-modal prompts. With the advanced utilization of the V-L model like CLIP, recent approaches deploy learnable prompts instead of hand-craft prompts to boost the generalization performance and address the aforementioned challenges. Inspired by layer-wise training, which is wildly used in image fusion, we note that using a sequential training process to adapt different modalities branches of CLIP efficiently facilitates the improvement of generalization. In the context of addressing the multi-modal prompting challenge, we propose Token-wise Adaptive for Multi-modal Prompt Learning (APLe) for tuning both modalities prompts, vision and language, as tokens in a sequential manner. APLe addresses the challenges in V-L models to promote prompt learning across both modalities, which indicates a competitive generalization performance in line with the state-of-the-art. Preeminently, APLe shows robustness and favourable performance in prompt-length experiments with an absolute advantage in adopting the V-L models.
- “Visual prompting: Modifying pixel space to adapt pre-trained models” In arXiv preprint arXiv:2203.17274, 2022
- Lukas Bossard, Matthieu Guillaumin and Luc Van Gool “Food-101–mining discriminative components with random forests” In The European Conference on Computer Vision, 2014, pp. 446–461
- “Uniter: Learning universal image-text representations” In European Conference on Computer Vision (ECCV), 2020
- “Describing textures in the wild” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2014, pp. 3606–3613
- “Imagenet: A large-scale hierarchical image database” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255
- “Decoupling zero-shot semantic segmentation” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 11583–11592
- Li Fei-Fei, Rob Fergus and Pietro Perona “Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories” In 2004 Conference on Computer Vision and Pattern Recognition Workshop, 2004, pp. 178–178
- “Promptdet: Towards open-vocabulary detection using uncurated images” In The European Conference on Computer Vision (ECCV), 2022
- “Clip-adapter: Better vision-language models with feature adapters” In arXiv preprint arXiv:2110.04544, 2021
- “Container: Context aggregation network” In Conference on Neural Information Processing Systems (NeurIPS), 2021
- “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification” In IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing 12.7, 2019, pp. 2217–2226
- “Natural adversarial examples” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15262–15271
- “The many faces of robustness: A critical analysis of out-of-distribution generalization” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8340–8349
- “Scaling up visual and vision-language representation learning with noisy text supervision” In International Conference on Machine Learning (ICML), 2021
- “Visual prompt tuning” In The European Conference on Computer Vision (ECCV), 2022
- “Prompting visual-language models for efficient video understanding” In The European Conference on Computer Vision (ECCV), 2021
- “MaPLe: Multi-modal Prompt Learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 19113–19122
- Jin-Hwa Kim, Jaehyun Jun and Byoung-Tak Zhang “Bilinear attention networks” In Advances in Neural Information Processing Systems (NeurIPS), 2018
- “How to adapt your large-scale vision-and-language model” arXiv preprint, 2022
- “3D object representations for fine-grained categorization” In Proceedings of the IEEE/CVF International Conference on Computer Vision, 2013, pp. 554–561
- Brian Lester, Rami Al-Rfou and Noah Constant “The power of scale for parameter-efficient prompt tuning” In Conference on Empirical Methods in Natural Language Processing (EMNLP), 2021
- “Language-driven semantic segmentation” In International Conference on Learning Representations (ICLR), 2022
- Xiang Lisa Li and Percy Liang “Prefix-tuning: Optimizing continuous prompts for generation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) and the 11th International Joint Conference on Natural Language Processing (IJCNLP), 2021
- “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL) and the 11th International Joint Conference on Natural Language Processing (IJCNLP), 2021
- “ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks” In Advances in Neural Information Processing Systems (NeurIPS), 2019
- “Class-agnostic object detection with multimodal transformer” In The European Conference on Computer Vision (ECCV), 2022
- “Fine-grained visual classification of aircraft” In arXiv preprint arXiv:1306.5151, 2013
- “Automated flower classification over a large number of classes” In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, 2008, pp. 722–729
- “Cats and dogs” In 2012 IEEE Conference on Computer Vision and Pattern Recognition, 2012, pp. 3498–3505
- “Learning transferable visual models from natural language supervision” In International Conference on Machine Learning (ICML), 2021
- “Denseclip: Language-guided dense prediction with context-aware prompting” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 18082–18091
- “Do imagenet classifiers generalize to imagenet?” In International Conference on Machine Learning, 2019, pp. 5389–5400
- Khurram Soomro, Amir Roshan Zamir and Mubarak Shah “UCF101: A dataset of 101 human actions classes from videos in the wild” In arXiv preprint arXiv:1212.0402, 2012
- “LXMERT: Learning cross-modality encoder representations from transformers” In Conference on Empirical Methods in Natural Language Processing and International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019
- “Learning robust global representations by penalizing local predictive power” In Advances in Neural Information Processing Systems (NeurIPS) 32, 2019
- “Dualprompt: Complementary prompting for rehearsal-free continual learning” In The European Conference on Computer Vision (ECCV), 2022
- “Learning to prompt for continual learning” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 139–149
- “SUN database: Large-scale scene recognition from abbey to zoo” In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2010, pp. 3485–3492
- “Deep modular co-attention networks for visual question answering” In Conference on Computer Vision and Pattern Recognition (CVPR), 2019
- “Open-vocabulary DETR with conditional matching” In The European Conference on Computer Vision (ECCV), 2022
- “Tip-adapter: Training-free clip-adapter for better vision-language modeling” In The European Conference on Computer Vision (ECCV), 2022
- “CDDFuse: Correlation-Driven Dual-Branch Feature Decomposition for Multi-Modality Image Fusion” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023, pp. 5906–5916
- “Conditional prompt learning for vision-language models” In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 16816–16825
- “Learning to prompt for vision-language models” In International Journal of Computer Vision (IJCV), 2022, pp. 1–12
- “Prompt-aligned gradient for prompt tuning” In arXiv preprint arXiv:2205.14865, 2022
- Guiming Cao (1 paper)
- Kaize Shi (9 papers)
- Hong Fu (6 papers)
- Huaiwen Zhang (9 papers)
- Guandong Xu (93 papers)