Papers
Topics
Authors
Recent
2000 character limit reached

Low-Rank Few-Shot Adaptation of Vision-Language Models

Published 28 May 2024 in cs.CV | (2405.18541v2)

Abstract: Recent progress in the few-shot adaptation of Vision-LLMs (VLMs) has further pushed their generalization capabilities, at the expense of just a few labeled samples within the target downstream task. However, this promising, already quite abundant few-shot literature has focused principally on prompt learning and, to a lesser extent, on adapters, overlooking the recent advances in Parameter-Efficient Fine-Tuning (PEFT). Furthermore, existing few-shot learning methods for VLMs often rely on heavy training procedures and/or carefully chosen, task-specific hyper-parameters, which might impede their applicability. In response, we introduce Low-Rank Adaptation (LoRA) in few-shot learning for VLMs, and show its potential on 11 datasets, in comparison to current state-of-the-art prompt- and adapter-based approaches. Surprisingly, our simple CLIP-LoRA method exhibits substantial improvements, while reducing the training times and keeping the same hyper-parameters in all the target tasks, i.e., across all the datasets and numbers of shots. Certainly, our surprising results do not dismiss the potential of prompt-learning and adapter-based research. However, we believe that our strong baseline could be used to evaluate progress in these emergent subjects in few-shot VLMs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (65)
  1. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 7319–7328, 2021.
  2. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014.
  3. Lasp: Text-to-text optimization for language-aware soft prompting of vision & language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23232–23241, 2023.
  4. One-for-all: Generalized lora for parameter-efficient fine-tuning. arXiv preprint arXiv:2306.07967, 2023.
  5. Plot: Prompt learning with optimal transport for vision-language models. In The Eleventh International Conference on Learning Representations, 2022a.
  6. Adaptformer: Adapting vision transformers for scalable visual recognition. Advances in Neural Information Processing Systems, 35:16664–16678, 2022b.
  7. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  8. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014.
  9. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  10. Variational prompt tuning improves generalization of vision-language models. arXiv preprint arXiv:2210.02390, 2022.
  11. Qlora: Efficient finetuning of quantized llms. Advances in Neural Information Processing Systems, 36, 2024.
  12. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004.
  13. Diverse data augmentation with diffusions for effective test-time prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 2704–2714, 2023.
  14. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
  15. Parameter-efficient transfer learning with diff pruning. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4884–4896, 2021.
  16. Calip: Zero-shot enhancement of clip with parameter-free attention. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 746–754, 2023.
  17. Towards a unified view of parameter-efficient transfer learning. In International Conference on Learning Representations, 2021.
  18. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  19. Nxmtransformer: semi-structured sparsification for natural language understanding via admm. Advances in neural information processing systems, 34:1818–1830, 2021.
  20. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  21. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2021.
  22. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
  23. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR, 2021.
  24. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  25. Compacter: Efficient low-rank hypercomplex adapter layers. In Advances in Neural Information Processing Systems, pages 1022–1035. Curran Associates, Inc., 2021.
  26. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
  27. Hydra: Multi-head low-rank adaptation for parameter efficient fine-tuning. arXiv preprint arXiv:2309.06922, 2023.
  28. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on computer vision workshops, pages 554–561, 2013.
  29. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  30. Measuring the intrinsic dimension of objective landscapes. In International Conference on Learning Representations, 2018.
  31. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975, 2022.
  32. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021.
  33. Scaling down to scale up: A guide to parameter-efficient fine-tuning. arXiv preprint arXiv:2303.15647, 2023.
  34. Scaling & shifting your features: A new baseline for efficient model tuning. Advances in Neural Information Processing Systems, 35:109–123, 2022.
  35. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys, 55(9):1–35, 2023.
  36. Prompt distribution learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5206–5215, 2022.
  37. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  38. Test-time prompt tuning for zero-shot generalization in vision-language models. In NeurIPS, 2022.
  39. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  40. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012.
  41. Adapterfusion: Non-destructive task composition for transfer learning. In Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume, pages 487–503, 2021.
  42. Learning transferable visual models from natural language supervision, 2021.
  43. Qdylora: Quantized dynamic low-rank adaptation for efficient large language model tuning. arXiv preprint arXiv:2402.10462, 2024.
  44. Learning multiple visual domains with residual adapters. Advances in neural information processing systems, 30, 2017.
  45. A closer look at the few-shot adaptation of large vision-language models. arXiv preprint arXiv:2312.12730, 2023.
  46. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  47. Vl-adapter: Parameter-efficient transfer learning for vision-and-language tasks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5227–5237, 2022.
  48. Dylora: Parameter efficient tuning of pre-trained models using dynamic search-free low-rank adaptation. arXiv preprint arXiv:2210.07558, 2022.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  51. Robust fine-tuning of zero-shot models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022.
  52. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer society conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010.
  53. Visual-language prompt tuning with knowledge-guided context optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023.
  54. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021.
  55. Task residual for tuning vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10899–10909, 2023.
  56. Bitfit: Simple parameter-efficient fine-tuning for transformer-based masked language-models. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1–9, 2022.
  57. On the test-time zero-shot generalization of vision-language models: Do we really need prompt learning?, 2024.
  58. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
  59. Vision-language models for vision tasks: A survey. arXiv preprint arXiv:2304.00685, 2023.
  60. Adaptive budget allocation for parameter-efficient fine-tuning. In The Eleventh International Conference on Learning Representations, 2022a.
  61. Tip-adapter: Training-free adaption of clip for few-shot classification. In European Conference on Computer Vision, pages 493–510. Springer, 2022b.
  62. Conditional prompt learning for vision-language models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022a.
  63. Learning to prompt for vision-language models. International Journal of Computer Vision (IJCV), 2022b.
  64. Prompt-aligned gradient for prompt tuning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15659–15669, 2023.
  65. Delta-lora: Fine-tuning high-rank parameters with the delta of low-rank matrices. arXiv preprint arXiv:2309.02411, 2023.
Citations (11)

Summary

  • The paper introduces a novel low-rank adaptation technique that boosts few-shot performance in vision-language models.
  • It shows that adapting both vision and text encoders simplifies training by reducing parameters and avoiding extra inference overhead.
  • Evaluation on 11 datasets, including ImageNet, demonstrates that CLIP-LoRA outperforms traditional prompt and adapter methods in accuracy.

Low-Rank Few-Shot Adaptation of Vision-LLMs

Introduction

The paper "Low-Rank Few-Shot Adaptation of Vision-LLMs" (2405.18541) introduces a novel approach to enhancing few-shot learning capabilities in Vision-LLMs (VLMs) using Low-Rank Adaptation (LoRA). Traditional few-shot learning methodologies have predominantly focused on prompt learning and, to a lesser extent, adapters, often requiring extensive computational resources and task-specific hyper-parameter tuning. In contrast, LoRA represents a more efficient alternative, offering significant performance improvements across diverse datasets while maintaining a consistent set of hyper-parameters. This simplification facilitates broader applicability of VLMs without extensive customization, positioning LoRA as a robust baseline for measuring advancements in few-shot learning strategies.

Parameter Efficient Fine-Tuning Strategies

The paper discusses the landscape of Parameter-Efficient Fine-Tuning (PEFT) methods, which aim to alleviate the computational burden associated with tuning large-scale models by optimizing a small subset of model parameters. PEFT encompasses several categories, including selective tuning methods, adapters, prompt tuning, and the novel approach of LoRA.

Selective methods fine-tune existing model weights, while adapter strategies introduce trainable modules, often increasing inference latency. Prompt tuning optimizes input token representations, albeit incurring substantial computational overhead. LoRA, however, adapts by introducing low-rank matrices that augment pre-trained weights. This approach not only reduces training complexity but also minimizes inference time by merging adapted matrices with the original weights, thus eliminating additional parameters post-adaptation. Figure 1

Figure 1

Figure 1: Prompt, Adapter and Low-rank techniques introduce extra parameters for training, which may potentially extend training duration and/or memory footprint in comparison to selective methods. However, they have the advantage of being more flexible and are often easier to use.

Few-Shot Learning for VLMs

The authors conducted an extensive evaluation of few-shot learning techniques across 11 datasets, including ImageNet and various fine-grained classification sets. The study compared LoRA-based adaptation to prompt and adapter-based methods like CoOp, CoCoOp, and TIP-Adapter-F. CLIP-LoRA demonstrated superior performance by significant margins on key datasets, achieving higher average accuracy across different visual backbones (ViT-B/16, ViT-B/32, ViT-L/14).

The results underscore LoRA’s potential to streamline few-shot training processes while simultaneously achieving state-of-the-art performance benchmarks, particularly noteworthy on datasets like ImageNet and UCF101, which require effective cross-modal feature alignment. Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2

Figure 2: Detailed few-shot learning results on the 10 fine-grained datasets and ImageNet with the ViT-B/16 visual backbone. Average performance for the ViT-B/16, ViT-B/32 and ViT-L/14 on the same 11 datasets is reported in the last three plots, respectively.

Design Considerations for LoRA

Critical design decisions guide the deployment of LoRA modules within VLMs, specifically regarding encoder selection, matrix adaptation, and rank assignment of low-rank matrices. Empirical results suggest that adapting both vision and text encoders, focusing particularly on value or output matrices, provides the most substantial performance gains. The strategic placement and parametrization of LoRA components are pivotal, suggesting further research opportunities in dynamically adjusting ranks and optimizing encoder interactions for specific tasks.

Conclusion

The paper presents CLIP-LoRA as a strong baseline in few-shot adaptation for VLMs, outperforming traditional methods while simplifying model training through fixed hyper-parameters. This approach offers practical implications for deploying VLMs in diverse application domains, facilitating efficient adaptation without compromising accuracy or requiring intricate manual configuration.

Future research should explore the dimensional adaptability of LoRA matrices and investigate further cross-modal tuning techniques, as these areas promise enhancements in model robustness and generalization capabilities in few-shot learning scenarios.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.