Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
133 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information (2412.00947v2)

Published 1 Dec 2024 in cs.CL and cs.CV

Abstract: Large Vision LLMs (LVLMs) have achieved remarkable performance in various vision-language tasks. However, it is still unclear how accurately LVLMs can perceive visual information in images. In particular, the capability of LVLMs to perceive geometric information, such as shape, angle, and size, remains insufficiently analyzed, although the perception of these properties is crucial for tasks that require a detailed visual understanding. In this work, we introduce VisOnlyQA, a dataset for evaluating the geometric perception of LVLMs, and reveal that LVLMs often cannot accurately perceive basic geometric information in images, while human performance is nearly perfect. VisOnlyQA consists of 12 tasks that directly ask about geometric information in geometric shapes, charts, chemical structures, and 3D shapes. Our experiments highlight the following findings: (i) State-of-the-art LVLMs struggle with basic geometric perception -- 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on VisOnlyQA. (ii) Additional training data does not resolve this issue -- fine-tuning on the training set of VisOnlyQA is not always effective, even for in-distribution tasks. (iii) Bottleneck in the architecture -- LVLMs using stronger LLMs exhibit better geometric perception on VisOnlyQA, while it does not require complex reasoning, suggesting that the way LVLMs process information from visual encoders is a bottleneck. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

Citations (1)

Summary

  • The paper reveals a significant performance gap between LVLMs and humans, with models achieving 51-54% accuracy against 93.5% human accuracy.
  • The paper finds that fine-tuning on 70k synthetic instances leads to inconsistent improvements, highlighting persistent challenges in visual processing.
  • The paper highlights that larger language model architectures enhance visual data interpretation, emphasizing the need for integrated vision-language design.

Analysis of Visual Perception Capabilities in Large Vision LLMs: Insights from VisOnlyQA

This paper addresses a fundamental issue in the application of Large Vision LLMs (LVLMs), particularly their propensity for visual perception errors when interpreting geometric and numerical information from images. Despite the iterative improvements in LVLM architectures, as evidenced by models such as GPT-4o and Gemini 1.5 Pro, the research reveals a persistent gap in visual information processing compared to human performance.

The authors introduce VisOnlyQA, a novel dataset precisely engineered to evaluate the visual perception capabilities of LVLMs independently of their reasoning and knowledge-based abilities. VisOnlyQA comprises a meticulously curated collection of 1,200 questions structured into twelve tasks, each targeting different types of figures: geometric shapes, chemical structures, charts, and 3D shapes. Additionally, the authors supply synthetic training data featuring 70k instances to aid LVLM development.

Key Findings

  1. Performance Discrepancy: The investigation reveals a significant performance gap between state-of-the-art LVLMs and human respondents on VisOnlyQA. Even advanced models like GPT-4o and Gemini 1.5 Pro demonstrate performance levels (51.4% and 54.2% accuracy, respectively) that are notably inferior to human accuracy, which reaches 93.5%.
  2. Limited Improvements from Fine-tuning: While fine-tuning LVLMs on synthetic data shows potential in enhancing visual perception for specific models and tasks, the improvements are neither consistent nor universal. Certain tasks and models benefited more from fine-tuning, suggesting that dataset-specific training can aid but not fully resolve existing deficiencies.
  3. Influence of LLMs: The architecture of the LLM within the LVLMs significantly impacts visual processing capabilities. Models utilizing larger language configurations demonstrate superior performance, indicating that LLMs contribute to processing and interpreting encoded visual information.

Implications for Future Research

The outcomes from the VisOnlyQA dataset highlight the need for targeted advancements in both dataset construction and model architecture to address visual perception challenges in LVLMs. For practitioners and researchers in AI, these findings underscore the necessity to refine both the training paradigms and model architectures to include diverse and comprehensive visual examples. Moreover, it's clear that simply scaling up model parameters or fine-tuning on synthetic datasets isn't sufficient; fundamental changes in how visual data is encoded and processed are essential.

Future Directions

The paper suggests several plausible directions for future research. First, expanding the dataset to include even more varied scientific figures may better expose weaknesses in LVLMs, prompting further optimization. Second, exploring novel model architectures that inherently fuse language and vision modalities more effectively could provide substantial improvements. Lastly, deeper analysis of how visual and language components interact within these models could lead to groundbreaking insights into the design of LVLMs with enhanced visual perception capabilities.

In conclusion, VisOnlyQA represents a significant step toward understanding and resolving the challenges faced by LVLMs in visual perception tasks. This research provides a focal point for future studies aiming to bridge the performance gap between AI and human visual understanding, ultimately contributing to the development of more robust and capable vision-language systems.