What Do You See? Enhancing Zero-Shot Image Classification with Multimodal Large Language Models
Abstract: LLMs have been effectively used for many computer vision tasks, including image classification. In this paper, we present a simple yet effective approach for zero-shot image classification using multimodal LLMs. Using multimodal LLMs, we generate comprehensive textual representations from input images. These textual representations are then utilized to generate fixed-dimensional features in a cross-modal embedding space. Subsequently, these features are fused together to perform zero-shot classification using a linear classifier. Our method does not require prompt engineering for each dataset; instead, we use a single, straightforward set of prompts across all datasets. We evaluated our method on several datasets and our results demonstrate its remarkable effectiveness, surpassing benchmark accuracy on multiple datasets. On average, for ten benchmarks, our method achieved an accuracy gain of 6.2 percentage points, with an increase of 6.8 percentage points on the ImageNet dataset, compared to prior methods re-evaluated with the same setup. Our findings highlight the potential of multimodal LLMs to enhance computer vision tasks such as zero-shot image classification, offering a significant improvement over traditional methods.
- What does a platypus look like? generating customized prompts for zero-shot image classification. In ICCV, 2023.
- Language models are few-shot learners. In NeurIPS, 2020.
- Learning transferable visual models from natural language supervision. In ICML, 2021.
- ImageNet: A large-scale hierarchical image database. In CVPR, 2009.
- Gemini Team Google. Gemini: A family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- CALIP: Zero-shot enhancement of clip with parameter-free attention. In AAAI, 2023.
- A survey of zero-shot learning: Settings, methods, and applications. ACM Transactions on Intelligent Systems and Technology, 10(2):1–37, 2019.
- A review of generalized zero-shot learning methods. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(4):4051–4070, 2022.
- VirTex: Learning visual representations from textual annotations. In CVPR, 2021.
- DeViSE: A deep visual-semantic embedding model. In NeurIPS, 2013.
- Scaling up visual and vision-language representation learning with noisy text supervision. In ICML, 2021.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 2022.
- High-resolution image synthesis with latent diffusion models. In CVPR, 2022.
- ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS, 2019.
- LXMERT: Learning cross-modality encoder representations from transformers. In EMNLP-IJCNLP, 2019.
- GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
- Cats and dogs. In CVPR, 2012.
- Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-SNE. Journal of Machine Learning Research, 9(11), 2008.
- Places: A 10 million image database for scene recognition. IEEE transactions on pattern analysis and machine intelligence, 40(6):1452–1464, 2017.
- Food-101 – mining discriminative components with random forests. In ECCV, 2014.
- Sun database: Large-scale scene recognition from abbey to zoo. In CVPR, 2010.
- Sun database: Exploring a large collection of scene categories. International Journal of Computer Vision, 119:3–22, 2016.
- 3d object representations for fine-grained categorization. In ICCV workshops, 2013.
- Describing textures in the wild. In CVPR, 2014.
- Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In CVPR workshops, 2004.
- Learning multiple layers of features from tiny images. University of Toronto, 2009.
- SLIP: Self-supervision meets language-image pre-training. In ECCV, 2022.
- PyramidCLIP: Hierarchical feature alignment for vision-language model pretraining. In NeurIPS, 2022.
- Non-contrastive learning meets language-image pre-training. In CVPR, 2023.
- NLIP: Noise-robust language-image pre-training. In AAAI, 2023.
- UniCLIP: Unified framework for contrastive language-image pre-training. In NeurIPS, 2022.
- ALIP: Adaptive language-image pre-training with synthetic caption. In ICCV, 2023.
- Tip-Adapter: Training-free clip-adapter for better vision-language modeling. In ECCV, 2022.
- SuS-X: Training-free name-only transfer of vision-language models. In ICCV, 2023.
- CLIP-Adapter: Better vision-language models with feature adapters. International Journal of Computer Vision, 132(2):581–595, 2024.
- Not all features matter: Enhancing few-shot clip with adaptive prior refinement. In ICCV, 2023.
- Your diffusion model is secretly a zero-shot classifier. In ICCV, 2023.
- A simple zero-shot prompt weighting technique to improve prompt ensembling in text-image models. In ICML, 2023.
- Zero-shot distillation for image encoders: How to make effective use of synthetic data. arXiv preprint arXiv:2404.16637, 2024.
- LLaMA: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
- DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108, 2019.
- RoBERTa: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Workshop on Text Summarization Branches Out, 2004.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.