CLAMP: Contrastive LAnguage Model Prompt-tuning (2312.01629v2)
Abstract: LLMs have emerged as powerful general-purpose interfaces for many machine learning problems. Recent work has adapted LLMs to generative visual tasks like image captioning, visual question answering, and visual chat, using a relatively small amount of instruction-tuning data. In this paper, we explore whether modern LLMs can also be adapted to classifying an image into a set of categories. First, we evaluate multimodal LLMs that are tuned for generative tasks on zero-shot image classification and find that their performance is far below that of specialized models like CLIP. We then propose an approach for light fine-tuning of LLMs using the same contrastive image-caption matching objective as CLIP. Our results show that LLMs can, indeed, achieve good image classification performance when adapted this way. Our approach beats state-of-the-art mLLMs by 13% and slightly outperforms contrastive learning with a custom text model, while also retaining the LLM's generative abilities. LLM initialization appears to particularly help classification in domains under-represented in the visual pre-training data.
- Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- Conceptual 12M: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In CVPR, 2021.
- A simple framework for contrastive learning of visual representations. In International conference on machine learning, pages 1597–1607. PMLR, 2020.
- Remote sensing image scene classification: Benchmark and state of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, 2023.
- Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416, 2022.
- Describing textures in the wild. In Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2014.
- An Analysis of Single Layer Networks in Unsupervised Feature Learning. In AISTATS, 2011. https://cs.stanford.edu/~acoates/papers/coatesleeng_aistats_2011.pdf.
- Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023.
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009.
- DROP: A reading comprehension benchmark requiring discrete reasoning over paragraphs. In Proc. of NAACL, 2019.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Pattern Recognition Workshop, 2004.
- Are we ready for autonomous driving? the kitti vision benchmark suite. In Conference on Computer Vision and Pattern Recognition (CVPR), 2012.
- Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification, 2017.
- Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
- Parameter-efficient transfer learning for nlp. In International Conference on Machine Learning, pages 2790–2799. PMLR, 2019.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Openclip, 2021. If you use this software, please cite it as below.
- CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017.
- Kaggle and EyePacs. Kaggle diabetic retinopathy detection, 2015.
- Supervised contrastive learning. Advances in neural information processing systems, 33:18661–18673, 2020.
- 3d object representations for fine-grained categorization. In 4th International IEEE Workshop on 3D Representation and Recognition (3dRR-13), Sydney, Australia, 2013.
- Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
- Mnist handwritten digit database. ATT Labs [Online]. Available: http://yann.lecun.com/exdb/mnist, 2, 2010.
- Read-only prompt optimization for vision-language few-shot learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023.
- The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
- Visualgptscore: Visio-linguistic reasoning with multimodal generative pre-training scores. arXiv preprint arXiv:2306.01879, 2023.
- Improved baselines with visual instruction tuning, 2023a.
- Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023b.
- P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. arXiv preprint arXiv:2110.07602, 2021.
- Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151, 2013.
- Reading digits in natural images with unsupervised feature learning. 2011.
- Automated flower classification over a large number of classes. In Proceedings of the Indian Conference on Computer Vision, Graphics and Image Processing, 2008.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
- R OpenAI. Gpt-4 technical report. arXiv, pages 2303–08774, 2023.
- Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition, 2012.
- Parameter-efficient tuning on layer normalization for pre-trained language models. arXiv preprint arXiv:2211.08682, 2022.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
- Laion-400m: Open dataset of clip-filtered 400 million image-text pairs. arXiv preprint arXiv:2111.02114, 2021.
- Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of ACL, 2018.
- Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023.
- Recursive deep models for semantic compositionality over a sentiment treebank. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 1631–1642, Seattle, Washington, USA, 2013. Association for Computational Linguistics.
- Man vs. computer: Benchmarking machine learning algorithms for traffic sign recognition. Neural networks, 32:323–332, 2012.
- Dime-fm : Distilling multimodal and efficient foundation models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 15521–15533, 2023.
- Challenging big-bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022.
- Yfcc100m: The new data in multimedia research. Communications of the ACM, 59(2):64–73, 2016.
- Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, pages 776–794. Springer, 2020.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492, 2010.
- Unified contrastive learning in image-text-label space. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19163–19173, 2022.
- The visual task adaptation benchmark. 2019.
- Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18123–18133, 2022.
- Sigmoid loss for language image pre-training. arXiv preprint arXiv:2303.15343, 2023.
- Opt: Open pre-trained transformer language models, 2022.
- Judging llm-as-a-judge with mt-bench and chatbot arena, 2023.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592, 2023.
- Piotr Teterwak (16 papers)
- Ximeng Sun (23 papers)
- Bryan A. Plummer (64 papers)
- Kate Saenko (178 papers)
- Ser-Nam Lim (116 papers)