Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs (2310.16033v3)

Published 24 Oct 2023 in cs.CV and cs.CL

Abstract: Multimodal LLMs (MLLMs) have recently achieved promising zero-shot accuracy on visual question answering (VQA) -- a fundamental task affecting various downstream applications and domains. Given the great potential for the broad use of these models, it is important to investigate their limitations in dealing with different image and question properties. In this work, we investigate whether MLLMs can perceive small details as well as large details in images. In particular, we show that their zero-shot accuracy in answering visual questions is very sensitive to the size of the visual subject of the question, declining up to 46% with size. Furthermore, we show that this effect is causal by observing that human visual cropping can significantly mitigate their sensitivity to size. Inspired by the usefulness of human cropping, we then propose five automatic visual cropping methods -- leveraging either external localization models or the decision process of the given MLLM itself -- as inference time mechanisms to improve the zero-shot performance of MLLMs. We study their effectiveness on four popular VQA datasets, and a subset of the VQAv2 dataset tailored towards fine visual details. Our findings suggest that MLLMs should be used with caution in detail-sensitive VQA applications, and that visual cropping is a promising direction to improve their zero-shot performance. To facilitate further investigation of MLLMs' behaviors, our code and data are publicly released.

PDF Abstract

Analyzing Small Visual Detail Perception in Zero-shot VQA with MLLMs

The paper “Towards Perceiving Small Visual Details in Zero-shot Visual Question Answering with Multimodal LLMs” presents a detailed examination of the capabilities and limitations of Multimodal LLMs (MLLMs) when tasked with Visual Question Answering (VQA) in a zero-shot setting. The researchers seek to understand and address the deficiency of these models in accurately interpreting small visual details within images, despite the burgeoning success of these models on broader VQA tasks.

Core Issues and Observations

The paper critically explores whether MLLMs, despite their vast pretraining on diverse datasets, falter when required to perceive fine visual details pertinent to specific questions. The authors highlight that the zero-shot accuracy of MLLMs, such as the BLIP-2 model, is notably sensitive to the sizes of objects of interest within images. Specifically, they document accuracy declines of up to 46% when dealing with smaller visual subjects, suggesting a substantial gap in the current perceptual capabilities of these models.

Furthermore, the paper presents human visual cropping as an intervention that significantly reduces the aforementioned accuracy gap. This indicates that perceptive limitations are causally linked to object size, proposing that explicit focus on relevant image regions can mitigate this issue. The efficacy of such approaches is tested on several datasets, including a tailored subset of VQAv2 focusing on fine visual details, as well as TextVQA, highlighting the importance of Ocular Character Recognition (OCR) and attention to textual components within images.

Proposed Methodology

To address this limitation in perceiving small details, the authors propose an array of automatic visual cropping methods that leverage external models such as CLIP, YOLO, and SAM, or the native decision processes of the MLLMs themselves. These cropping strategies are aimed at improving the zero-shot VQA performance:

clip-CROP: Employs a CLIP-based similarity metric to prioritize image regions most aligned with the question context, through progressive cropping.
yolo-CROP: Applies YOLO for object detection to preemptively filter out nonsalient regions, then determines region relevance using CLIP.
sam-CROP: Utilizes SAM for extensive segmentation, considering a broader set of potential regions pertinent to the question.

In addition to these external knowledge techniques, native methodologies such as grad-CROP and att-CROP are employed to paper if the models themselves can guide successful region localization based on inference-time attention and gradient dynamics.

Results and Implications

The empirical results underscore the merit of visual cropping in recovering performance deficits related to small visual detail perception in MLLMs. While human-influenced image cropping unsurprisingly yields the highest accuracy improvements (up to 45.35% on TextVQA), the att-CROP method comes closest to human performance by successfully harnessing attention flows within the model, thereby demonstrating that MLLMs have intrinsic capabilities to approximate where attention should be focused.

Moreover, the findings reveal that the proposed cropping methods effectively enhance the performance on questions requiring detail-oriented reasoning without sacrificing global reasoning capabilities for other tasks, such as counting and general localization. This versatility suggests that integrating dynamic cropping approaches could form a key part of future MLLM architectures and trained systems, highlighting a promising direction for improving VQA systems.

Future Directions

This research opens several avenues for further inquiry:

It questions the underlying architectural and training mechanism biases that lead to impaired perception of small visual elements.
Expanding similar investigations across diverse MLLMs will validate the generalizability of the proposed interventions.
It encourages developing new strategies that more closely mimic the efficacy of human cropping, perhaps suggesting model architectures capable of adapting focus based on task demands without external model integrations.

In conclusion, the paper offers both a diagnostic insight into a crucial limitation of current MLLMs in VQA tasks and practical interventions to significantly alleviate this issue through targeted visual cropping strategies. This work provides a pivotal step in developing more perceptively nuanced AI systems that can capably automate and augment various real-world vision-related domains, ultimately driving advancements in how AI models interact with multimodal data.