Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck
The paper titled "Vision LLMs Are Bad at Hierarchical Visual Understanding, and LLMs Are the Bottleneck" presents a rigorous investigation into the limitations of current Vision Large Language Models (VLLMs) regarding hierarchical understanding in visual classification tasks. The researchers focus on VLLMs' ability to process and interpret structured semantic hierarchies found in images, highlighting a significant bottleneck in their performance: Large Language Models (LLMs) themselves.
Hierarchical Understanding in VLLMs
Image classification typically involves identifying objects within images based on pre-trained models that can map visual inputs to categorical labels. Hierarchical image classification extends this ability by requiring models to not only identify fine-grained leaf nodes but also ensure consistency along taxonomic paths leading to these leaf nodes. In practice, this involves mapping images to both detailed and generalized categories accurately. The paper leverages a taxonomy spanning various domains, including biological categories like animals and plants, and man-made artifacts, to assess VLLMs' performance.
Key Findings
The authors' examination of VLLMs involves using a structured set of approximately one million visual question answering tasks across six hierarchical classification datasets. Across these tasks, they measure hierarchical consistent accuracy (HCA) and leaf-level accuracy ((\text{Acc}_\text{leaf})). The paper highlights several crucial findings:
Deficient Hierarchical Consistency: Despite performing reasonably well in identifying fine-grained classes, VLLMs show a marked deficiency in preserving hierarchical consistency. State-of-the-art models, including open-source variants like Qwen2.5-VL series, show significant drops in HCA compared to (\text{Acc}_\text{leaf}). This discrepancy underscores a fundamental limitation in ensuring semantic coherence across taxonomy levels.
Influence of Model Scale: The results suggest that scaling up the model size improves both hierarchical consistency and leaf-level accuracy. However, even large models like Qwen2.5-VL-72B still fall short of optimal consistency over taxonomic paths.
Visual Embeddings are Not the Bottleneck: Through extensive probing of visual embeddings across different stages of VLLM architecture, the paper reveals that visual tokens retain rich discriminative features even before reaching the language processing layers. This indicates that the limitation is not inherent in the visual representation but rather in the subsequent handling by LLMs.
LLMs Lack Hierarchical Knowledge: When examined independently through text-only classification tasks, LLMs demonstrate inconsistent hierarchical reasoning. Low performance in these tasks points to a lack of taxonomy awareness, reaffirming the hypothesis that LLMs serve as the bottleneck in hierarchical visual understanding.
Finetuning and Future Directions
Although finetuning VLLMs on hierarchical tasks offers improvements in consistency, the researchers speculate that overcoming taxonomy knowledge gaps may require more thorough interventions at the pre-training stage rather than post-hoc adjustments. This opens avenues for enriching LLMs with explicit taxonomy-oriented data during initial training phases. Enhanced taxonomy knowledge in LLMs would likely cascade improvements to their visual counterparts, embedding hierarchical consistency deeply within multimodal language models.
Broader Implications
Understanding and resolving this hierarchical inconsistency is crucial for deploying VLLMs in applications requiring both fine-grained and generalized insights, such as biodiversity assessments, medical imaging, and autonomous systems that incorporate layered decision-making based on semantic hierarchies. Furthermore, improving hierarchical reasoning challenges us to rethink the foundational precepts within AI language model training—potentially enriching datasets with structured, taxonomy-based annotations and exploring novel model architectures that prioritize semantic coherence.
This paper serves as a clarion call to the AI research community, emphasizing the need to bridge gaps in taxonomy knowledge within language models to advance hierarchical visual understanding in current AI systems.