Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AMU-Tuning: Effective Logit Bias for CLIP-based Few-shot Learning (2404.08958v1)

Published 13 Apr 2024 in cs.CV, cs.CL, and cs.LG

Abstract: Recently, pre-trained vision-LLMs (e.g., CLIP) have shown great potential in few-shot learning and attracted a lot of research interest. Although efforts have been made to improve few-shot ability of CLIP, key factors on the effectiveness of existing methods have not been well studied, limiting further exploration of CLIP's potential in few-shot learning. In this paper, we first introduce a unified formulation to analyze CLIP-based few-shot learning methods from a perspective of logit bias, which encourages us to learn an effective logit bias for further improving performance of CLIP-based few-shot learning methods. To this end, we disassemble three key components involved in computation of logit bias (i.e., logit features, logit predictor, and logit fusion) and empirically analyze the effect on performance of few-shot classification. Based on analysis of key components, this paper proposes a novel AMU-Tuning method to learn effective logit bias for CLIP-based few-shot classification. Specifically, our AMU-Tuning predicts logit bias by exploiting the appropriate $\underline{\textbf{A}}$uxiliary features, which are fed into an efficient feature-initialized linear classifier with $\underline{\textbf{M}}$ulti-branch training. Finally, an $\underline{\textbf{U}}$ncertainty-based fusion is developed to incorporate logit bias into CLIP for few-shot classification. The experiments are conducted on several widely used benchmarks, and the results show AMU-Tuning clearly outperforms its counterparts while achieving state-of-the-art performance of CLIP-based few-shot learning without bells and whistles.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (68)
  1. Flamingo: a visual language model for few-shot learning. In NeurIPS, 2022.
  2. BEiT: BERT pre-training of image transformers. In ICLR, 2022.
  3. Improved few-shot visual classification. In CVPR. IEEE, 2020.
  4. Enhancing few-shot image classification with unlabelled examples. In WACV, 2022.
  5. Food-101–Mining discriminative components with random forests. In ECCV, pages 446–461. Springer, 2014.
  6. Language models are few-shot learners. NeurIPS, 33, 2020.
  7. Emerging properties in self-supervised vision transformers. In ICCV, pages 9650–9660, 2021.
  8. PLOT: prompt learning with optimal transport for vision-language models. In ICLR, 2023a.
  9. Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297, 2020.
  10. An empirical study of training self-supervised vision transformers. In ICCV, pages 9640–9649, 2021.
  11. PaLI: A jointly-scaled multilingual language-image model. In ICLR, 2023b.
  12. Describing textures in the wild. In CVPR, pages 3606–3613, 2014.
  13. ImageNet: A large-scale hierarchical image database. In CVPR, pages 248–255. Ieee, 2009.
  14. BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL-HLT, pages 4171–4186. Association for Computational Linguistics, 2019.
  15. An image is worth 16x16 words: Transformers for image recognition at scale. In ICLR, 2021.
  16. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshop, pages 178–178. IEEE, 2004.
  17. CLIP-Adapter: Better vision-language models with feature adapters. IJCV, pages 1–15, 2023.
  18. CALIP: zero-shot enhancement of CLIP with parameter-free attention. In AAAI, pages 746–754, 2023.
  19. Deep residual learning for image recognition. In CVPR, pages 770–778, 2016.
  20. Momentum contrast for unsupervised visual representation learning. In CVPR, pages 9726–9735, 2020.
  21. Masked autoencoders are scalable vision learners. In CVPR, pages 16000–16009, 2022.
  22. EuroSAT: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  23. The many faces of robustness: A critical analysis of out-of-distribution generalization. In ICCV, pages 8340–8349, 2021a.
  24. Natural adversarial examples. In CVPR, pages 15262–15271, 2021b.
  25. MILAN: Masked image pretraining on language assisted representation. arXiv preprint arXiv:2208.06049, 2022.
  26. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649, 2022.
  27. Context-aware alignment and mutual masking for 3d-language pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10984–10994, 2023.
  28. Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122, 2023a.
  29. Maple: Multi-modal prompt learning. In CVPR, pages 19113–19122. IEEE, 2023b.
  30. 3D object representations for fine-grained categorization. In ICCV workshops, pages 554–561, 2013.
  31. Energy-based out-of-distribution detection. In Advances in Neural Information Processing Systems, pages 21464–21475. Curran Associates, Inc., 2020.
  32. Swin transformer: Hierarchical vision transformer using shifted windows. In ICCV, pages 10012–10022, 2021.
  33. Swin transformer v2: Scaling up capacity and resolution. In CVPR, pages 12009–12019, 2022.
  34. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  35. SLIP: Self-supervision meets language-image pre-training. In ECCV, pages 529–544. Springer, 2022.
  36. Quality not quantity: On the interaction between dataset design and robustness of clip. NeurIPS, 35:21455–21469, 2022.
  37. On first-order meta-learning algorithms. CoRR, abs/1803.02999, 2018.
  38. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & image processing, pages 722–729. IEEE, 2008.
  39. R OpenAI. GPT-4 technical report. arxiv 2303.08774. View in Article, 2023.
  40. Tadam: Task dependent adaptive metric for improved few-shot learning. NeurIPS, 31, 2018.
  41. Cats and dogs. In CVPR, pages 3498–3505. IEEE Computer Society, 2012.
  42. PyTorch: An imperative style, high-performance deep learning library. In NeurIPS. Curran Associates, Inc., 2019.
  43. VT-CLIP: Enhancing vision-language models with visual-guided texts. arXiv preprint arXiv:2112.02399, 2021.
  44. Learning transferable visual models from natural language supervision. In ICML, pages 8748–8763, 2021.
  45. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(1):5485–5551, 2020.
  46. Zero-shot text-to-image generation. In ICML, pages 8821–8831. PMLR, 2021.
  47. Do ImageNet classifiers generalize to ImageNet? In ICML, pages 5389–5400. PMLR, 2019.
  48. Fast and flexible multi-task classification using conditional neural adaptive processes. In NeurIPS 2019.
  49. Embedding propagation: Smoother manifold for few-shot classification. In ECCV, pages 121–138. Springer, 2020.
  50. Prototypical networks for few-shot learning. NeurIPS, 30, 2017.
  51. UCF101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.
  52. Confidence estimation of classification based on the distribution of the neural network output layer. arXiv preprint arXiv:2210.07745, 2022.
  53. Designing BERT for convolutional networks: Sparse and hierarchical masked modeling. In ICLR, 2023.
  54. Contrastive multiview coding. In ECCV, pages 776–794. Springer, 2020a.
  55. What makes for good views for contrastive learning? NeurIPS, 33, 2020b.
  56. Training data-efficient image transformers & distillation through attention. In ICML, pages 10347–10357, 2021.
  57. SuS-X: Training-free name-only transfer of vision-language models. In ICCV, pages 2725–2736, 2023.
  58. Learning robust global representations by penalizing local predictive power. NeurIPS, 32, 2019.
  59. VideoMAE v2: Scaling video masked autoencoders with dual masking. In CVPR, pages 14549–14560, 2023a.
  60. Image as a foreign language: BEiT pretraining for vision and vision-language tasks. In CVPR, pages 19175–19186. IEEE, 2023b.
  61. SUN database: Large-scale scene recognition from abbey to zoo. In CVPR, pages 3485–3492. IEEE, 2010.
  62. Zero-shot point cloud segmentation by semantic-visual aware synthesis. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11586–11596, 2023.
  63. CoCa: Contrastive captioners are image-text foundation models. Trans. Mach. Learn. Res., 2022.
  64. Tip-Adapter: Training-free adaption of clip for few-shot classification. In ECCV, pages 493–510. Springer, 2022.
  65. Prompt, generate, then cache: Cascade of foundation models makes strong few-shot learners. In CVPR, pages 15211–15222, 2023.
  66. iBOT: Image BERT pre-training with online tokenizer. arXiv preprint arXiv:2111.07832, 2021.
  67. Conditional prompt learning for vision-language models. In CVPR, pages 16816–16825, 2022a.
  68. Learning to prompt for vision-language models. IJCV, 130(9):2337–2348, 2022b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuwei Tang (1 paper)
  2. Zhenyi Lin (5 papers)
  3. Qilong Wang (34 papers)
  4. Pengfei Zhu (76 papers)
  5. Qinghua Hu (83 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com