Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models (2405.11301v1)
Abstract: Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-LLMs (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-LLMs (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.
- Flamingo: a visual language model for few-shot learning. ArXiv preprint, abs/2204.14198.
- Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390.
- Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, abs/2308.12966.
- Sr-gnn: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Transactions on Image Processing, 31:6017–6031.
- Birdsnap: Large-scale fine-grained visual categorization of birds. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2019–2026.
- Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
- The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, 29:4683–4695.
- This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32.
- Knowledge-embedded representation learning for fine-grained image recognition. In International Joint Conference on Artificial Intelligence.
- Uniter: Learning universal image-text representations. ArXiv.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
- A survey for in-context learning.
- Maximum-entropy fine-grained classification. ArXiv, abs/1809.05934.
- Sharpness-aware minimization for efficiently improving generalization. CoRR, abs/2010.01412.
- Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484.
- Escaping the big data paradigm with compact transformers. CoRR, abs/2104.05704.
- The inaturalist species classification and detection dataset.
- Openclip. If you use this software, please cite it as below.
- Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916.
- 3d object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13).
- Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401.
- BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900.
- Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9694–9705.
- Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 475–486.
- M3IT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
- Visualbert: A simple and performant baseline for vision and language. ArXiv.
- Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. of ECCV.
- Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
- Scaling language-image pre-training via masking.
- Visual instruction tuning. ArXiv preprint, abs/2304.08485.
- Fine-grained visual classification of aircraft. Technical report.
- Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models.
- Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
- OpenAI. 2022. Introducing chatgpt.
- OpenAI. 2023. Gpt-4v(ision) system card.
- Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
- Delving into the openness of CLIP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9587–9606, Toronto, Canada. Association for Computational Linguistics.
- Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition.
- Imagenet-21k pretraining for the masses. CoRR, abs/2104.10972.
- LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
- Burr Settles. 2009. Active learning literature survey.
- The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV.
- Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
- VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
- Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111.
- Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
- Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492.
- Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251.
- Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1100–1113.
- The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Learning multi-attention convolutional neural network for fine-grained image recognition. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5219–5227.
- Learning to prompt for vision-language models. CoRR, abs/2109.01134.
- Conditional prompt learning for vision-language models.
- Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.
- Canshi Wei (2 papers)