Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
129 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

LP++: A Surprisingly Strong Linear Probe for Few-Shot CLIP (2404.02285v1)

Published 2 Apr 2024 in cs.CV

Abstract: In a recent, strongly emergent literature on few-shot CLIP adaptation, Linear Probe (LP) has been often reported as a weak baseline. This has motivated intensive research building convoluted prompt learning or feature adaptation strategies. In this work, we propose and examine from convex-optimization perspectives a generalization of the standard LP baseline, in which the linear classifier weights are learnable functions of the text embedding, with class-wise multipliers blending image and text knowledge. As our objective function depends on two types of variables, i.e., the class visual prototypes and the learnable blending parameters, we propose a computationally efficient block coordinate Majorize-Minimize (MM) descent algorithm. In our full-batch MM optimizer, which we coin LP++, step sizes are implicit, unlike standard gradient descent practices where learning rates are intensively searched over validation sets. By examining the mathematical properties of our loss (e.g., Lipschitz gradient continuity), we build majorizing functions yielding data-driven learning rates and derive approximations of the loss's minima, which provide data-informed initialization of the variables. Our image-language objective function, along with these non-trivial optimization insights and ingredients, yields, surprisingly, highly competitive few-shot CLIP performances. Furthermore, LP++ operates in black-box, relaxes intensive validation searches for the optimization hyper-parameters, and runs orders-of-magnitudes faster than state-of-the-art few-shot CLIP adaptation methods. Our code is available at: \url{https://github.com/FereshteShakeri/FewShot-CLIP-Strong-Baseline.git}.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. On the convergence of block coordinate descent type methods. SIAM Journal on Optimization 23(4), 23(4):2037–2060, 2013.
  2. Food-101 – mining discriminative components with random forests. In European Conference on Computer Vision, pages 446–461, 2014.
  3. Sébastien Bubeck. Convex optimization: Algorithms and complexity. Foundations and Trends in Machine Learning, 8(3-4):231–358, 2015.
  4. Prompt learning with optimal transport for vision-language models. In International Conference on Learning Representations, 2023.
  5. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition, 2014.
  6. Transductive learning for textual few-shot classification in api-based embedding models. In Empirical Methods in Natural Language Processing, 2023.
  7. Imagenet: A large-scale hierarchical image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages 248–255, 2009.
  8. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Workshop on Computer Vision and Pattern Recognition, pages 178–178, 2004.
  9. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, pages 1–15, 2023.
  10. Deep residual learning for image recognition. In IEEE Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  11. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019.
  12. Steffen Herbold. Autorank: A python package for automated ranking of classifiers. Journal of Open Source Software, 5(48):2173, 2020.
  13. Iteration complexity analysis of block coordinate descent methods. Mathematical Programming, 163:85–114, 2017.
  14. Factual probing is [mask]: Learning vs. learning to recall. In Conference of the North American Chapter of the Association for Computational Linguistics, 2021.
  15. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning, pages 4904–4916, 2021.
  16. How can we know what language models know. In Association for Computational Linguistics, 2020.
  17. 3d object representations for fine-grained categorization. In International Conference on Computer Vision, pages 554–561, 2013.
  18. Optimization transfer using surrogate objective functions. Journal of computational and graphical statistics, 9(1):1–20, 2000.
  19. Fine-grained visual classification of aircraft. In arXiv, 2013.
  20. Automated flower classification over a large number of classes. In 2008 Sixth Indian Conference on Computer Vision, Graphics and Image Processing, pages 722–729, 2008.
  21. Jorge Nocedal. Updating quasi-newton matrices with limited storage. Mathematics of Computation, 35(151):773–782, 1980.
  22. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3498–3505. IEEE, 2012.
  23. Learning transferable visual models from natural language supervision. In International Conference on Machine Learning, pages 8748–8763, 2021.
  24. Autoprompt: Eliciting knowledge from language models with automatically generated prompts. In arXiv, 2020.
  25. Ucf101: A dataset of 101 human actions classes from videos in the wild. In arXiv, 2012.
  26. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3485–3492. IEEE, 2010.
  27. Visual-language prompt tuning with knowledge-guided context optimization. In IEEE Conference on Computer Vision and Pattern Recognition, pages 6757–6767, 2023.
  28. Tip-adapter: Training-free adaption of clip for few-shot classification. In European Conference on Computer Vision, pages 493–510. Springer, 2022.
  29. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022.
  30. Prompt-aligned gradient for prompt tuning. In International Conference on Computer Vision, 2023.
Citations (9)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com