Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 43 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 20 tok/s Pro
GPT-4o 95 tok/s Pro
Kimi K2 180 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 32 tok/s Pro
2000 character limit reached

Cross-Modal Adapter: Parameter-Efficient Transfer Learning Approach for Vision-Language Models (2404.12588v1)

Published 19 Apr 2024 in cs.CV and cs.LG

Abstract: Adapter-based parameter-efficient transfer learning has achieved exciting results in vision-LLMs. Traditional adapter methods often require training or fine-tuning, facing challenges such as insufficient samples or resource limitations. While some methods overcome the need for training by leveraging image modality cache and retrieval, they overlook the text modality's importance and cross-modal cues for the efficient adaptation of parameters in visual-LLMs. This work introduces a cross-modal parameter-efficient approach named XMAdapter. XMAdapter establishes cache models for both text and image modalities. It then leverages retrieval through visual-language bimodal information to gather clues for inference. By dynamically adjusting the affinity ratio, it achieves cross-modal fusion, decoupling different modal similarities to assess their respective contributions. Additionally, it explores hard samples based on differences in cross-modal affinity and enhances model performance through adaptive adjustment of sample learning intensity. Extensive experimental results on benchmark datasets demonstrate that XMAdapter outperforms previous adapter-based methods significantly regarding accuracy, generalization, and efficiency.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. “Learning transferable visual models from natural language supervision,” in ICML, 2021.
  2. “Scaling up visual and vision-language representation learning with noisy text supervision,” in International conference on machine learning. PMLR, 2021, pp. 4904–4916.
  3. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12888–12900.
  4. “Lora: Low-rank adaptation of large language models,” arXiv preprint arXiv:2106.09685, 2021.
  5. “The power of scale for parameter-efficient prompt tuning,” arXiv preprint arXiv:2104.08691, 2021.
  6. “Parameter-efficient transfer learning for nlp,” in International Conference on Machine Learning. PMLR, 2019, pp. 2790–2799.
  7. “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
  8. “Clip-adapter: Better vision-language models with feature adapters,” International Journal of Computer Vision, pp. 1–15, 2023.
  9. “Tip-adapter: Training-free adaption of clip for few-shot classification,” in European Conference on Computer Vision. Springer, 2022, pp. 493–510.
  10. “Graphadapter: Tuning vision-language models with dual knowledge graph,” arXiv preprint arXiv:2309.13625, 2023.
  11. “Task residual for tuning vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10899–10909.
  12. “Imagenet: A large-scale hierarchical image database,” in 2009 IEEE conference on computer vision and pattern recognition. Ieee, 2009, pp. 248–255.
  13. “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  14. “An image is worth 16x16 words: Transformers for image recognition at scale,” in 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. 2021, OpenReview.net.
  15. “Bilinear attention networks,” Advances in neural information processing systems, vol. 31, 2018.
  16. “Deep modular co-attention networks for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6281–6290.
  17. “Dynamic fusion with intra-and inter-modality attention flow for visual question answering,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 6639–6648.
  18. “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio, Eds. 2019, pp. 4171–4186, Association for Computational Linguistics.
  19. “LXMERT: learning cross-modality encoder representations from transformers,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan, Eds. 2019, pp. 5099–5110, Association for Computational Linguistics.
  20. “Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks,” Advances in neural information processing systems, vol. 32, 2019.
  21. “UNITER: learning universal image-text representations,” CoRR, vol. abs/1909.11740, 2019.
  22. “Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm,” in The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. 2022, OpenReview.net.
  23. “Conditional prompt learning for vision-language models,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 16816–16825.
  24. “P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), 2022, pp. 61–68.
  25. “P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks,” arXiv preprint arXiv:2110.07602, 2021.
  26. “Visual prompt tuning,” in Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXXIII. Springer, 2022, pp. 709–727.
  27. “Neural prompt search,” arXiv preprint arXiv:2206.04673, 2022.
  28. “Dynamic prompting: A unified framework for prompt tuning,” arXiv preprint arXiv:2303.02909, 2023.
  29. “Unified vision and language prompt learning,” arXiv preprint arXiv:2210.07225, 2022.
  30. “Maple: Multi-modal prompt learning,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19113–19122.
  31. “Unbounded cache model for online language modeling with open vocabulary,” Advances in neural information processing systems, vol. 30, 2017.
  32. “Matching networks for one shot learning,” Advances in neural information processing systems, vol. 29, 2016.
  33. “Prototypical networks for few-shot learning,” Advances in neural information processing systems, vol. 30, 2017.
  34. “Model-agnostic meta-learning for fast adaptation of deep networks,” in International conference on machine learning. PMLR, 2017, pp. 1126–1135.
  35. “Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories,” in 2004 conference on computer vision and pattern recognition workshop. IEEE, 2004, pp. 178–178.
  36. “Cats and dogs,” in 2012 IEEE conference on computer vision and pattern recognition. IEEE, 2012, pp. 3498–3505.
  37. “3d object representations for fine-grained categorization,” in Proceedings of the IEEE international conference on computer vision workshops, 2013, pp. 554–561.
  38. “Automated flower classification over a large number of classes,” in 2008 Sixth Indian conference on computer vision, graphics & image processing. IEEE, 2008, pp. 722–729.
  39. “Food-101–mining discriminative components with random forests,” in Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13. Springer, 2014, pp. 446–461.
  40. “Fine-grained visual classification of aircraft,” arXiv preprint arXiv:1306.5151, 2013.
  41. “Sun database: Large-scale scene recognition from abbey to zoo,” in 2010 IEEE computer society conference on computer vision and pattern recognition. IEEE, 2010, pp. 3485–3492.
  42. “Describing textures in the wild,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2014, pp. 3606–3613.
  43. “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification,” IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, vol. 12, no. 7, pp. 2217–2226, 2019.
  44. “A dataset of 101 human action classes from videos in the wild,” Center for Research in Computer Vision, vol. 2, no. 11, 2012.
  45. “Do imagenet classifiers generalize to imagenet?,” in International Conference on Machine Learning. PMLR, 2019, pp. 5389–5400.
  46. “Learning robust global representations by penalizing local predictive power,” Advances in Neural Information Processing Systems, vol. 32, 2019.
  47. “Natural adversarial examples,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 15262–15271.
  48. “The many faces of robustness: A critical analysis of out-of-distribution generalization,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8340–8349.
Citations (2)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets

This paper has been mentioned in 1 post and received 0 likes.