Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Finer: Investigating and Enhancing Fine-Grained Visual Concept Recognition in Large Vision Language Models (2402.16315v3)

Published 26 Feb 2024 in cs.CV and cs.CL

Abstract: Recent advances in instruction-tuned Large Vision-LLMs (LVLMs) have imbued the models with the ability to generate high-level, image-grounded explanations with ease. While such capability is largely attributed to the rich world knowledge contained within the LLMs, our work reveals their shortcomings in fine-grained visual categorization (FGVC) across six different benchmark settings. Most recent state-of-the-art LVLMs like LLaVa-1.5, InstructBLIP and GPT-4V not only severely deteriorate in terms of classification performance, e.g., average drop of 65.58 in EM for Stanford Dogs for LLaVA-1.5, but also struggle to generate an accurate explanation with detailed attributes based on the concept that appears within an input image despite their capability to generate holistic image-level descriptions. In-depth analyses show that instruction-tuned LVLMs exhibit modality gap, showing discrepancy when given textual and visual inputs that correspond to the same concept, preventing the image modality from leveraging the rich parametric knowledge within the LLMs. In an effort to further the community's endeavor in this direction, we propose a multiple granularity attribute-centric evaluation benchmark, Finer, which aims to establish a ground to evaluate LVLMs' fine-grained visual comprehension ability and provide significantly improved explainability.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Shikra: Unleashing multimodal llm’s referential dialogue magic. arXiv preprint arXiv:2306.15195.
  2. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  3. Scaling instruction-finetuned language models. arXiv preprint arXiv:2210.11416.
  4. Instructblip: Towards general-purpose vision-language models with instruction tuning. arxiv 2023. arXiv preprint arXiv:2305.06500.
  5. Metaformer: A unified meta framework for fine-grained recognition. arXiv preprint arXiv:2203.02751.
  6. Lala Hajibayova. 2013. Basic-level categories: A review. Journal of Information Science, 39(5):676–687.
  7. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000–16009.
  8. Lora: Low-rank adaptation of large language models. In International Conference on Learning Representations.
  9. Novel dataset for fine-grained image categorization: Stanford dogs. In Proc. CVPR workshop on fine-grained visual categorization (FGVC), volume 2. Citeseer.
  10. Collecting a large-scale dataset of fine-grained cars.
  11. Image retrieval from contextual descriptions. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3426–3440, Dublin, Ireland. Association for Computational Linguistics.
  12. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  13. Grounded language-image pre-training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10965–10975.
  14. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. In NeurIPS.
  15. Improved baselines with visual instruction tuning.
  16. Visual instruction tuning.
  17. Fine-grained visual classification of aircraft. Technical report.
  18. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204.
  19. Sachit Menon and Carl Vondrick. 2023. Visual classification via description from large language models. In The Eleventh International Conference on Learning Representations.
  20. OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
  21. Kosmos-2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824.
  22. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
  23. Perceptual grouping in contrastive vision-language models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 5571–5584.
  24. Hiera: A hierarchical vision transformer without the bells-and-whistles. ICML.
  25. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer.
  26. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  27. Building a bird recognition app and large scale dataset with citizen scientists: The fine print in fine-grained dataset collection. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 595–604.
  28. The inaturalist species classification and detection dataset. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 8769–8778.
  29. The caltech-ucsd birds-200-2011 dataset.
  30. Knowledge mining with scene text for fine-grained recognition. 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4614–4623.
  31. Paxion: Patching video-language foundation models with action knowledge. In Proc. 2023 Conference on Neural Information Processing Systems (NeurIPS2023) [Spotlight Paper].
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
  33. Fine-grained image analysis with deep learning: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(12):8927–8948.
  34. Fine-grained object classification via self-supervised pose alignment. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7399–7408.
  35. The dawn of lmms: Preliminary explorations with gpt-4v (ision). arXiv preprint arXiv:2309.17421, 9(1):1.
  36. Ferret: Refer and ground anything anywhere at any granularity. arXiv preprint arXiv:2310.07704.
  37. When and why vision-language models behave like bags-of-words, and what to do about it? In The Eleventh International Conference on Learning Representations.
  38. AlignScore: Evaluating factual consistency with a unified alignment function. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11328–11348, Toronto, Canada. Association for Computational Linguistics.
  39. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  40. Regionclip: Region-based language-image pretraining. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16793–16803.
  41. Dual cross-attention learning for fine-grained visual categorization and object re-identification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4692–4702.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jeonghwan Kim (20 papers)
  2. Heng Ji (266 papers)
Citations (1)
X Twitter Logo Streamline Icon: https://streamlinehq.com