Vision language models are blind (2407.06581v5)

Published 9 Jul 2024 in cs.AI and cs.CV

Abstract: While LLMs with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, are powering various image-text applications and scoring high on many vision-understanding benchmarks, we find that they are surprisingly still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks such as identifying (a) whether two circles overlap; (b) whether two lines intersect; (c) which letter is being circled in a word; and (d) counting circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.57% accurate on average. Claude 3.5 Sonnet performs the best at 74.94% accuracy, but this is still far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs consistently struggle with tasks that require precise spatial information and recognizing geometric primitives that overlap or are close together. Code and data are available at: https://vlmsareblind.github.io

PDF HTML Abstract

Analysis of Vision LLMs on Simple Visual Tasks

The paper explores the diagnostic evaluation of state-of-the-art Vision LLMs (VLMs) using a series of simple visual tasks targeted at examining their capacity to precisely perceive and interpret basic geometric primitives. The authors evaluate notable models such as GPT, Gemini, Claude, and Sonnet across various tasks that are trivial for humans but appear to be deceptively difficult for the VLMs.

Key Findings

Intersections of Lines and Circles:
- Most VLMs exhibit significant difficulty in identifying the number of intersections between two simple 2D line plots. The performance varies from 47% to 85% accuracy, significantly lower than expected. When it comes to detecting overlapping circles, the best accuracy observed is 92.78%, but none hit the perfect 100%, indicating potential deficiencies in the models' visual acuity.
- These results suggest that VLMs have a rudimentary level of visual recognition, where they often struggle with tasks that should be straightforward, indicating a fundamental limitation in fine detail recognition.
Circled Letter Identification:
- Despite the capabilities of VLMs in recognizing individual letters and simple shapes, the models consistently fail to accurately identify a letter circled within a word. The maximum accuracy achieved is around 92.81% by Gemini, while others frequently confuse adjacent letters or misinterpret the red circle as part of the letter itself.
- This underscores a critical issue where VLMs' vision may falter on tasks requiring precise localization within an image, potentially due to insufficient granularity in visual processing.
Counting Overlapping and Nested Shapes:
- In tests involving counting overlapping shapes, such as circles or pentagons, and nested squares, VLMs show a marked decline in accuracy as the number of shapes increases. Sonnet demonstrates the best performance, albeit still not flawless. For instance, while Sonnet achieves 87.50% accuracy in counting nested squares, it falls significantly short in counting overlapping pentagons with only 20.83% accuracy.
- The tendency of models to often predict the number '5' for circles suggests a bias towards the Olympic logo pattern, highlighting the influence of familiar training data on models' performance.
Grid Counting:
- When tasked with counting rows and columns in grids, both empty and text-containing, VLMs perform inconsistently, with Sonnet reaching up to 88.68% accuracy on text-containing grids. However, this performance drops substantially for empty grids, showcasing that the presence of textual content aids the models in maintaining spatial consistency.
- The difficulty observed in simply counting the rows and columns further elucidates the models' shortcomings in understanding spatial arrangement devoid of semantic content.
Path Tracing:
- On tasks requiring the identification of single-color paths in simplified subway maps, VLMs again demonstrate a high error rate, particularly as the complexity of the maps increases (i.e., higher number of paths). Models are often off by one to three paths in their count, with Sonnet performing the best but still with considerable error.
- This indicates a profound challenge in following and interpreting lines that necessitate tracing a continuous path—an essential functionality for real-world applications.

Implications and Future Directions

The findings in this paper have profound implications for the development of more sophisticated VLMs:

Enhancing Granular Visual Perception: There is a clear need to improve the granularity of visual perception in VLMs to ensure they can accurately perceive and interpret fine details. This might involve integrating more advanced vision encoders that are capable of retaining high-resolution visual information.
Bias Mitigation: Addressing model bias, as evidenced by the tendency to default to familiar patterns such as the Olympic logo, is crucial. This can be achieved through more diverse and balanced training datasets to prevent overfitting to specific patterns.
Testing on Synthetic Benchmarks: The authors stress the importance of synthetic benchmarks that remove background knowledge and focus purely on visual capabilities. This avoids data leakage issues and more accurately reflects a model's intrinsic ability to interpret visual information.
Early Fusion Techniques: Given the limitations identified in late-fusion approaches, exploring early-fusion techniques wherein visual and textual information are integrated at an earlier stage in the model architecture may yield better results in tasks requiring precise visual understanding.

Conclusion

In conclusion, the paper effectively highlights the fundamental limitations in the visual acuity of current VLMs through systematically designed low-level visual tasks. The observed deficiencies point to an essential area for future research, aiming to develop VLMs that can process and understand visual data with the same accuracy and granularity as human vision, thereby enhancing their applicability across a broader range of real-world tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Pooyan Rahmanzadehgervi (2 papers)
Logan Bolton (3 papers)
Mohammad Reza Taesiri (17 papers)
Anh Totti Nguyen (13 papers)

Citations (14)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

VLMs are Blind
GitHub - anguyen8/vision-llms-are-blind (103 stars)

Tweets

https://twitter.com/maxisawesome538/status/1812875508511814143

https://twitter.com/MelMitchell1/status/1811473642519200026

https://twitter.com/fly51fly/status/1812599708134916218

https://twitter.com/LChoshen/status/1811477899830034543

https://twitter.com/anh_ng8/status/1811110318640353471

https://twitter.com/_vztu/status/1811152763373637939