Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CLAMP: Contrastive LAnguage Model Prompt-tuning (2312.01629v2)

Published 4 Dec 2023 in cs.CV

Abstract: LLMs have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (59)
  1. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
  2. Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
  3. A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
  4. Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
  5. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
  6. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
  7. Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
  8. An Analysis of Single Layer Networks in Unsupervised Feature Learning. In AISTATS, 2011. https://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf.
  9. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
  10. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
  11. DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, 2019.
  12. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004.
  13. Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
  14. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, 2017.
  15. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  16. Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
  17. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  18. Openclip, 2021. If you use this software, please cite it as below.
  19. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
  20. Kaggle and EyePacs. Kaggle diabetic retinopathy detection, 2015.
  21. Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
  22. 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
  23. Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
  24. Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
  25. Read-only prompt optimization for vision-language few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
  26. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  27. Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores. arXiv preprint arXiv:2306.01879, 2023.
  28. Improved baselines with visual instruction tuning, 2023a.
  29. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
  30. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
  31. Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
  32. Reading digits in natural images with unsupervised feature learning. 2011.
  33. Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
  34. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  35. R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
  36. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
  37. Parameter-efficient tuning on layer normalization for pre-trained language models. arXiv preprint arXiv:2211.08682, 2022.
  38. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  39. Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
  40. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
  41. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
  42. Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, 2013. Association for Computational Linguistics.
  43. Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
  44. Dime-fm : Distilling multimodal and efficient foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15521–15533, 2023.
  45. Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
  46. Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
  47. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
  48. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  49. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  50. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
  51. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
  52. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
  53. Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
  54. The visual task adaptation benchmark. 2019.
  55. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
  56. Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
  57. Opt: Open pre-trained transformer language models, 2022.
  58. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
  59. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Piotr Teterwak (16 papers)
  2. Ximeng Sun (23 papers)
  3. Bryan A. Plummer (64 papers)
  4. Kate Saenko (178 papers)
  5. Ser-Nam Lim (116 papers)