Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

Published 30 May 2025 in cs.CV, cs.AI, cs.CL, and cs.LG | (2505.24840v1)

Abstract: This paper reveals that many state-of-the-art LLMs lack hierarchical knowledge about our visual world, unaware of even well-established biology taxonomies. This shortcoming makes LLMs a bottleneck for vision LLMs' hierarchical visual understanding (e.g., recognizing Anemone Fish but not Vertebrate). We arrive at these findings using about one million four-choice visual question answering (VQA) tasks constructed from six taxonomies and four image datasets. Interestingly, finetuning a vision LLM using our VQA tasks reaffirms LLMs' bottleneck effect to some extent because the VQA tasks improve the LLM's hierarchical consistency more than the vision LLM's. We conjecture that one cannot make vision LLMs understand visual concepts fully hierarchical until LLMs possess corresponding taxonomy knowledge.

Abstract PDF Upgrade to Chat

Authors (3)

Summary

Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck

The paper titled "Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck" presents a rigorous investigation into the limitations of current Vision Large Language Models (VLLMs) regarding hierarchical understanding in visual classification tasks. The researchers focus on VLLMs' ability to process and interpret structured semantic hierarchies found in images, highlighting a significant bottleneck in their performance: Large Language Models (LLMs) themselves.

Hierarchical Understanding in VLLMs

Image classification typically involves identifying objects within images based on pre-trained models that can map visual inputs to categorical labels. Hierarchical image classification extends this ability by requiring models to not only identify fine-grained leaf nodes but also ensure consistency along taxonomic paths leading to these leaf nodes. In practice, this involves mapping images to both detailed and generalized categories accurately. The paper leverages a taxonomy spanning various domains, including biological categories like animals and plants, and man-made artifacts, to assess VLLMs' performance.

Key Findings

The authors' examination of VLLMs involves using a structured set of approximately one million visual question answering tasks across six hierarchical classification datasets. Across these tasks, they measure hierarchical consistent accuracy (HCA) and leaf-level accuracy ((\text{Acc}_\text{leaf})). The paper highlights several crucial findings:

Deficient Hierarchical Consistency: Despite performing reasonably well in identifying fine-grained classes, VLLMs show a marked deficiency in preserving hierarchical consistency. State-of-the-art models, including open-source variants like Qwen2.5-VL series, show significant drops in HCA compared to (\text{Acc}_\text{leaf}). This discrepancy underscores a fundamental limitation in ensuring semantic coherence across taxonomy levels.
Influence of Model Scale: The results suggest that scaling up the model size improves both hierarchical consistency and leaf-level accuracy. However, even large models like Qwen2.5-VL-72B still fall short of optimal consistency over taxonomic paths.
Visual Embeddings are Not the Bottleneck: Through extensive probing of visual embeddings across different stages of VLLM architecture, the paper reveals that visual tokens retain rich discriminative features even before reaching the language processing layers. This indicates that the limitation is not inherent in the visual representation but rather in the subsequent handling by LLMs.
LLMs Lack Hierarchical Knowledge: When examined independently through text-only classification tasks, LLMs demonstrate inconsistent hierarchical reasoning. Low performance in these tasks points to a lack of taxonomy awareness, reaffirming the hypothesis that LLMs serve as the bottleneck in hierarchical visual understanding.

Finetuning and Future Directions

Although finetuning VLLMs on hierarchical tasks offers improvements in consistency, the researchers speculate that overcoming taxonomy knowledge gaps may require more thorough interventions at the pre-training stage rather than post-hoc adjustments. This opens avenues for enriching LLMs with explicit taxonomy-oriented data during initial training phases. Enhanced taxonomy knowledge in LLMs would likely cascade improvements to their visual counterparts, embedding hierarchical consistency deeply within multimodal language models.

Broader Implications

Understanding and resolving this hierarchical inconsistency is crucial for deploying VLLMs in applications requiring both fine-grained and generalized insights, such as biodiversity assessments, medical imaging, and autonomous systems that incorporate layered decision-making based on semantic hierarchies. Furthermore, improving hierarchical reasoning challenges us to rethink the foundational precepts within AI language model training—potentially enriching datasets with structured, taxonomy-based annotations and exploring novel model architectures that prioritize semantic coherence.

This paper serves as a clarion call to the AI research community, emphasizing the need to bridge gaps in taxonomy knowledge within language models to advance hierarchical visual understanding in current AI systems.