Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Enhancing Fine-Grained Image Classifications via Cascaded Vision Language Models (2405.11301v1)

Published 18 May 2024 in cs.CL and cs.CV

Abstract: Fine-grained image classification, particularly in zero/few-shot scenarios, presents a significant challenge for vision-LLMs (VLMs), such as CLIP. These models often struggle with the nuanced task of distinguishing between semantically similar classes due to limitations in their pre-trained recipe, which lacks supervision signals for fine-grained categorization. This paper introduces CascadeVLM, an innovative framework that overcomes the constraints of previous CLIP-based methods by effectively leveraging the granular knowledge encapsulated within large vision-LLMs (LVLMs). Experiments across various fine-grained image datasets demonstrate that CascadeVLM significantly outperforms existing models, specifically on the Stanford Cars dataset, achieving an impressive 85.6% zero-shot accuracy. Performance gain analysis validates that LVLMs produce more accurate predictions for challenging images that CLIPs are uncertain about, bringing the overall accuracy boost. Our framework sheds light on a holistic integration of VLMs and LVLMs for effective and efficient fine-grained image classification.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (56)
  1. Flamingo: a visual language model for few-shot learning. ArXiv preprint, abs/2204.14198.
  2. Openflamingo: An open-source framework for training large autoregressive vision-language models. ArXiv preprint, abs/2308.01390.
  3. Qwen-vl: A frontier large vision-language model with versatile abilities. ArXiv preprint, abs/2308.12966.
  4. Sr-gnn: Spatial relation-aware graph neural network for fine-grained image categorization. IEEE Transactions on Image Processing, 31:6017–6031.
  5. Birdsnap: Large-scale fine-grained visual categorization of birds. In 2014 IEEE Conference on Computer Vision and Pattern Recognition, pages 2019–2026.
  6. Language models are few-shot learners. In Advances in Neural Information Processing Systems, volume 33, pages 1877–1901. Curran Associates, Inc.
  7. The devil is in the channels: Mutual-channel loss for fine-grained image classification. IEEE Transactions on Image Processing, 29:4683–4695.
  8. This looks like that: deep learning for interpretable image recognition. Advances in neural information processing systems, 32.
  9. Knowledge-embedded representation learning for fine-grained image recognition. In International Joint Conference on Artificial Intelligence.
  10. Uniter: Learning universal image-text representations. ArXiv.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  12. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186.
  13. A survey for in-context learning.
  14. Maximum-entropy fine-grained classification. ArXiv, abs/1809.05934.
  15. Sharpness-aware minimization for efficiently improving generalization. CoRR, abs/2010.01412.
  16. Look closer to see better: Recurrent attention convolutional neural network for fine-grained image recognition. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 4476–4484.
  17. Escaping the big data paradigm with compact transformers. CoRR, abs/2104.05704.
  18. The inaturalist species classification and detection dataset.
  19. Openclip. If you use this software, please cite it as below.
  20. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916.
  21. 3d object representations for fine-grained categorization. In 4th IEEE Workshop on 3D Representation and Recognition, at ICCV 2013 (3dRR-13).
  22. Retrieval-augmented generation for knowledge-intensive NLP tasks. CoRR, abs/2005.11401.
  23. BLIP: bootstrapping language-image pre-training for unified vision-language understanding and generation. In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pages 12888–12900.
  24. Align before fuse: Vision and language representation learning with momentum distillation. In Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pages 9694–9705.
  25. Cascadebert: Accelerating inference of pre-trained language models via calibrated complete models cascade. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 475–486.
  26. M3IT: A large-scale dataset towards multi-modal multilingual instruction tuning. ArXiv preprint, abs/2306.04387.
  27. Visualbert: A simple and performant baseline for vision and language. ArXiv.
  28. Oscar: Object-semantics aligned pre-training for vision-language tasks. In Proc. of ECCV.
  29. Supervision exists everywhere: A data efficient contrastive language-image pre-training paradigm. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022.
  30. Scaling language-image pre-training via masking.
  31. Visual instruction tuning. ArXiv preprint, abs/2304.08485.
  32. Fine-grained visual classification of aircraft. Technical report.
  33. Sachit Menon and Carl Vondrick. 2022. Visual classification via description from large language models.
  34. Maria-Elena Nilsback and Andrew Zisserman. 2008. Automated flower classification over a large number of classes. In Indian Conference on Computer Vision, Graphics and Image Processing.
  35. OpenAI. 2022. Introducing chatgpt.
  36. OpenAI. 2023. Gpt-4v(ision) system card.
  37. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763.
  38. Delving into the openness of CLIP. In Findings of the Association for Computational Linguistics: ACL 2023, pages 9587–9606, Toronto, Canada. Association for Computational Linguistics.
  39. Prompt pre-training with twenty-thousand classes for open-vocabulary visual recognition.
  40. Imagenet-21k pretraining for the masses. CoRR, abs/2104.10972.
  41. LAION-5b: An open large-scale dataset for training next generation image-text models. In Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track.
  42. Burr Settles. 2009. Active learning literature survey.
  43. The effectiveness of mae pre-pretraining for billion-scale pretraining. In ICCV.
  44. Zero-shot learning through cross-modal transfer. In Advances in Neural Information Processing Systems, volume 26. Curran Associates, Inc.
  45. VL-BERT: pre-training of generic visual-linguistic representations. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020.
  46. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111.
  47. Llama: Open and efficient foundation language models. ArXiv preprint, abs/2302.13971.
  48. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pages 3485–3492.
  49. Deebert: Dynamic early exiting for accelerating bert inference. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2246–2251.
  50. Webly-supervised fine-grained visual categorization via deep domain adaptation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 40:1100–1113.
  51. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9.
  52. A survey of large language models. arXiv preprint arXiv:2303.18223.
  53. Learning multi-attention convolutional neural network for fine-grained image recognition. 2017 IEEE International Conference on Computer Vision (ICCV), pages 5219–5227.
  54. Learning to prompt for vision-language models. CoRR, abs/2109.01134.
  55. Conditional prompt learning for vision-language models.
  56. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (1)
  1. Canshi Wei (2 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets