Why are Visually-Grounded Language Models Bad at Image Classification? (2405.18415v2)

Published 28 May 2024 in cs.CV, cs.AI, cs.CL, and cs.LG

Abstract: Image classification is one of the most fundamental capabilities of machine vision intelligence. In this work, we revisit the image classification task using visually-grounded LLMs (VLMs) such as GPT-4V and LLaVA. We find that existing proprietary and public VLMs, despite often using CLIP as a vision encoder and having many more parameters, significantly underperform CLIP on standard image classification benchmarks like ImageNet. To understand the reason, we explore several hypotheses concerning the inference algorithms, training objectives, and data processing in VLMs. Our analysis reveals that the primary cause is data-related: critical information for image classification is encoded in the VLM's latent space but can only be effectively decoded with enough training data. Specifically, there is a strong correlation between the frequency of class exposure during VLM training and instruction-tuning and the VLM's performance in those classes; when trained with sufficient data, VLMs can match the accuracy of state-of-the-art classification models. Based on these findings, we enhance a VLM by integrating classification-focused datasets into its training, and demonstrate that the enhanced classification performance of the VLM transfers to its general capabilities, resulting in an improvement of 11.8% on the newly collected ImageWikiQA dataset.

PDF HTML Abstract

Analyzing the Image Classification Capabilities of Visually-Grounded LLMs

The paper "Why are Visually-Grounded LLMs Bad at Image Classification?" addresses a significant gap in the performance of visually-grounded LLMs (VLMs) compared to traditional image classification models such as CLIP. Despite the integration of powerful vision encoders and large parameter bases, these VLMs underperform in standard image classification tasks, such as those found in ImageNet.

Evaluation and Findings

The analysis begins with a comparative evaluation of ten prominent VLMs, both proprietary and public, including GPT-4V and LLaVA, against CLIP models on datasets like ImageNet and Flowers102. The results consistently highlight a substantial performance discrepancy. For instance, the best-performing VLM achieves an accuracy of just 60.6% on ImageNet, while a CLIP model attains 79.2%.

Investigating the Causes

The paper explores understanding why VLMs fall short in classification tasks. This investigation is structured around inference processes, training methodologies, and the role of data:

Inference: The paper explores the impact of prompt variations and inference strategies. Techniques like probabilistic inference improve VLM performance but fail to bridge the gap significantly.
Training Approach: A striking finding is that despite VLMs retaining classification-relevant information in their latent spaces, the generative text objective in training does not leverage this effectively. Furthermore, the results demonstrate that fine-tuning VLMs using classification-based data can achieve performance on par with traditional models when using an appropriate training objective.
Data: The paper identifies data as the core issue, showing a strong correlation between class exposure during training and performance. VLMs that have sufficient training data with classification information perform well, highlighting the critical role of comprehensive, label-rich datasets.

Enhancing VLM Capabilities

In addressing these challenges, the paper proposes a method of integrating classification data into VLM training. This simple but effective strategy enhances both classification accuracy and the general capabilities of the models. An illustrative example using the newly created ImageWikiQA dataset shows an improvement of 11.8% in complex visual question answering, emphasizing the benefit of integrating foundational classification tasks into the training regimen.

Implications and Future Directions

The implications of these findings are significant, suggesting that robust classification serves as a foundation for more complex visual and reasoning tasks. The research lays the groundwork for further exploration into data-efficient training strategies and informs future developments that aim at improving not only VLMs' performance in classification but also their broader application scope.

Conclusion

The systematic analysis provided in the paper contributes to a clear understanding of current limitations in VLMs and offers practical approaches to enhance their capabilities. It underscores the importance of targeted data incorporation and suggests paths for advancing AI applications reliant on visual understanding. Future research might explore zero-shot learning strategies or hybrid models that mitigate the need for exhaustive data while maintaining high performance in both classification and more advanced tasks.