Papers
Topics
Authors
Recent
Search
2000 character limit reached

Vision language models are blind: Failing to translate detailed visual features into words

Published 9 Jul 2024 in cs.AI and cs.CV | (2407.06581v6)

Abstract: While LLMs with vision capabilities (VLMs), e.g., GPT-4o and Gemini 1.5 Pro, score high on many vision-understanding benchmarks, they are still struggling with low-level vision tasks that are easy to humans. Specifically, on BlindTest, our suite of 7 very simple tasks, including identifying (a) whether two circles overlap; (b) how many times two lines intersect; (c) which letter is being circled in a word; and (d) the number of circles in an Olympic-like logo, four state-of-the-art VLMs are only 58.07% accurate on average. Claude 3.5 Sonnet performs the best at 77.84% accuracy, far from the human expected accuracy of 100%. Across different image resolutions and line widths, VLMs including slow-thinking models consistently struggle with those tasks that require precise spatial information when geometric primitives overlap or are close. Yet, VLMs perform at near-100% accuracy when much more space is added to separate shapes and letters. Linear probing experiments show that vision encoders contain sufficient visual information to solve BlindTest and that LLMs fail to decode this information into correct answers. Code and data are at: https://vlmsareblind.github.io

Citations (14)

Summary

  • The paper demonstrates that vision-language models struggle with basic visual tasks, showing accuracy as low as 20.83% in complex shape counting.
  • It evaluates models like GPT, Gemini, Claude, and Sonnet, highlighting consistent errors in interpreting intersections and circled letters.
  • The study suggests enhancements in early fusion techniques and advanced encoders to improve fine-detail recognition and reduce training biases.

Analysis of Vision LLMs on Simple Visual Tasks

The paper explores the diagnostic evaluation of state-of-the-art Vision LLMs (VLMs) using a series of simple visual tasks targeted at examining their capacity to precisely perceive and interpret basic geometric primitives. The authors evaluate notable models such as GPT, Gemini, Claude, and Sonnet across various tasks that are trivial for humans but appear to be deceptively difficult for the VLMs.

Key Findings

  1. Intersections of Lines and Circles:
    • Most VLMs exhibit significant difficulty in identifying the number of intersections between two simple 2D line plots. The performance varies from 47% to 85% accuracy, significantly lower than expected. When it comes to detecting overlapping circles, the best accuracy observed is 92.78%, but none hit the perfect 100%, indicating potential deficiencies in the models' visual acuity.
    • These results suggest that VLMs have a rudimentary level of visual recognition, where they often struggle with tasks that should be straightforward, indicating a fundamental limitation in fine detail recognition.
  2. Circled Letter Identification:
    • Despite the capabilities of VLMs in recognizing individual letters and simple shapes, the models consistently fail to accurately identify a letter circled within a word. The maximum accuracy achieved is around 92.81% by Gemini, while others frequently confuse adjacent letters or misinterpret the red circle as part of the letter itself.
    • This underscores a critical issue where VLMs' vision may falter on tasks requiring precise localization within an image, potentially due to insufficient granularity in visual processing.
  3. Counting Overlapping and Nested Shapes:
    • In tests involving counting overlapping shapes, such as circles or pentagons, and nested squares, VLMs show a marked decline in accuracy as the number of shapes increases. Sonnet demonstrates the best performance, albeit still not flawless. For instance, while Sonnet achieves 87.50% accuracy in counting nested squares, it falls significantly short in counting overlapping pentagons with only 20.83% accuracy.
    • The tendency of models to often predict the number '5' for circles suggests a bias towards the Olympic logo pattern, highlighting the influence of familiar training data on models' performance.
  4. Grid Counting:
    • When tasked with counting rows and columns in grids, both empty and text-containing, VLMs perform inconsistently, with Sonnet reaching up to 88.68% accuracy on text-containing grids. However, this performance drops substantially for empty grids, showcasing that the presence of textual content aids the models in maintaining spatial consistency.
    • The difficulty observed in simply counting the rows and columns further elucidates the models' shortcomings in understanding spatial arrangement devoid of semantic content.
  5. Path Tracing:
    • On tasks requiring the identification of single-color paths in simplified subway maps, VLMs again demonstrate a high error rate, particularly as the complexity of the maps increases (i.e., higher number of paths). Models are often off by one to three paths in their count, with Sonnet performing the best but still with considerable error.
    • This indicates a profound challenge in following and interpreting lines that necessitate tracing a continuous path—an essential functionality for real-world applications.

Implications and Future Directions

The findings in this paper have profound implications for the development of more sophisticated VLMs:

  • Enhancing Granular Visual Perception: There is a clear need to improve the granularity of visual perception in VLMs to ensure they can accurately perceive and interpret fine details. This might involve integrating more advanced vision encoders that are capable of retaining high-resolution visual information.
  • Bias Mitigation: Addressing model bias, as evidenced by the tendency to default to familiar patterns such as the Olympic logo, is crucial. This can be achieved through more diverse and balanced training datasets to prevent overfitting to specific patterns.
  • Testing on Synthetic Benchmarks: The authors stress the importance of synthetic benchmarks that remove background knowledge and focus purely on visual capabilities. This avoids data leakage issues and more accurately reflects a model's intrinsic ability to interpret visual information.
  • Early Fusion Techniques: Given the limitations identified in late-fusion approaches, exploring early-fusion techniques wherein visual and textual information are integrated at an earlier stage in the model architecture may yield better results in tasks requiring precise visual understanding.

Conclusion

In conclusion, the paper effectively highlights the fundamental limitations in the visual acuity of current VLMs through systematically designed low-level visual tasks. The observed deficiencies point to an essential area for future research, aiming to develop VLMs that can process and understand visual data with the same accuracy and granularity as human vision, thereby enhancing their applicability across a broader range of real-world tasks.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 37 tweets with 1388 likes about this paper.

HackerNews

Reddit

  1. Vision Language Models are Blind (13 points, 10 comments) 
  2. Vision language models are blind (6 points, 1 comment)