Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LAMM: Label Alignment for Multi-Modal Prompt Learning (2312.08212v1)

Published 13 Dec 2023 in cs.CV

Abstract: With the success of pre-trained visual-language (VL) models such as CLIP in visual representation tasks, transferring pre-trained models to downstream tasks has become a crucial paradigm. Recently, the prompt tuning paradigm, which draws inspiration from NLP, has made significant progress in VL field. However, preceding methods mainly focus on constructing prompt templates for text and visual inputs, neglecting the gap in class label representations between the VL models and downstream tasks. To address this challenge, we introduce an innovative label alignment method named \textbf{LAMM}, which can dynamically adjust the category embeddings of downstream datasets through end-to-end training. Moreover, to achieve a more appropriate label distribution, we propose a hierarchical loss, encompassing the alignment of the parameter space, feature space, and logits space. We conduct experiments on 11 downstream vision datasets and demonstrate that our method significantly improves the performance of existing multi-modal prompt learning models in few-shot scenarios, exhibiting an average accuracy improvement of 2.31(\%) compared to the state-of-the-art methods on 16 shots. Moreover, our methodology exhibits the preeminence in continual learning compared to other prompt tuning methods. Importantly, our method is synergistic with existing prompt tuning methods and can boost the performance on top of them. Our code and dataset will be publicly available at https://github.com/gaojingsheng/LAMM.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. Food-101 - Mining Discriminative Components with Random Forests. In Proc. of ECCV.
  2. Describing Textures in the Wild. In Proc. of CVPR.
  3. Prototypical Verbalizer for Prompt-based Few-shot Tuning. In Proc. of ACL.
  4. Imagenet: A large-scale hierarchical image database. In Proc. of CVPR.
  5. A Survey of Vision-Language Pre-Trained Models. In Proc. of IJCAI.
  6. Learning generative visual models from few training examples: An incremental Bayesian approach tested on 101 object categories. Comput. Vis. Image Underst.
  7. CLIP-Adapter: Better Vision-Language Models with Feature Adapters. CoRR.
  8. Making Pre-trained Language Models Better Few-shot Learners. In Proc. of ACL.
  9. Simcse: Simple contrastive learning of sentence embeddings. arXiv preprint arXiv:2104.08821.
  10. WARP: Word-level Adversarial ReProgramming. In Proc. of ACL.
  11. EuroSAT: A Novel Dataset and Deep Learning Benchmark for Land Use and Land Cover Classification. IEEE J. Sel. Top. Appl. Earth Obs. Remote. Sens.
  12. Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision. In Proc. of ICML.
  13. Visual Prompt Tuning. In Proc. of ECCV.
  14. MaPLe: Multi-modal Prompt Learning. CoRR.
  15. Overcoming catastrophic forgetting in neural networks. Proceedings of the national academy of sciences.
  16. 3D Object Representations for Fine-Grained Categorization. In Proc. of ICCV.
  17. The Power of Scale for Parameter-Efficient Prompt Tuning. In Proc. of EMNLP.
  18. Language-driven Semantic Segmentation. In Proc. of ICLR.
  19. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. arXiv preprint arXiv:2107.13586.
  20. Fine-Grained Visual Classification of Aircraft. CoRR.
  21. Automated Flower Classification over a Large Number of Classes. In Sixth Indian Conference on Computer Vision, Graphics and Image Processing, ICVGIP 2008, Bhubaneswar, India, 16-19 December 2008.
  22. Cats and dogs. In Proc. of CVPR.
  23. Learning Transferable Visual Models From Natural Language Supervision. In Proc. of ICML.
  24. Do imagenet classifiers generalize to imagenet? In Proc. of ICML.
  25. Exploiting Cloze-Questions for Few-Shot Text Classification and Natural Language Inference. In Proc. of EACL.
  26. UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild. CoRR.
  27. VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks. In Proc. of CVPR.
  28. Learning robust global representations by penalizing local predictive power. Proc. of NeurIPS.
  29. SUN database: Large-scale scene recognition from abbey to zoo. In Proc. of CVPR.
  30. FILIP: Fine-grained Interactive Language-Image Pre-Training. In Proc. of ICLR.
  31. Open-Vocabulary DETR with Conditional Matching. In Proc. of ECCV.
  32. LiT: Zero-Shot Transfer with Locked-image text Tuning. In Proc. of CVPR.
  33. Conditional Prompt Learning for Vision-Language Models. In Proc. of CVPR.
  34. Learning to Prompt for Vision-Language Models. Int. J. Comput. Vis.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Jingsheng Gao (16 papers)
  2. Jiacheng Ruan (20 papers)
  3. Suncheng Xiang (27 papers)
  4. Zefang Yu (4 papers)
  5. Ke Ji (27 papers)
  6. Mingye Xie (10 papers)
  7. Ting Liu (329 papers)
  8. Yuzhuo Fu (24 papers)
Citations (10)