VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information (2412.00947v1)

Published 1 Dec 2024 in cs.CL and cs.CV

Abstract: Errors in understanding visual information in images (i.e., visual perception errors) remain a major source of mistakes in Large Vision LLMs (LVLMs). While further analysis is essential, there is a deficiency in datasets for evaluating the visual perception of LVLMs. In this work, we introduce VisOnlyQA, a new dataset designed to directly evaluate the visual perception capabilities of LVLMs on questions about geometric and numerical information in scientific figures. Our dataset enables us to analyze the visual perception of LVLMs for fine-grained visual information, independent of other capabilities such as reasoning. The evaluation set of VisOnlyQA includes 1,200 multiple-choice questions in 12 tasks on four categories of figures. We also provide synthetic training data consisting of 70k instances. Our experiments on VisOnlyQA highlight the following findings: (i) 20 LVLMs we evaluate, including GPT-4o and Gemini 1.5 Pro, work poorly on the visual perception tasks in VisOnlyQA, while human performance is nearly perfect. (ii) Fine-tuning on synthetic training data demonstrates the potential for enhancing the visual perception of LVLMs, but observed improvements are limited to certain tasks and specific models. (iii) Stronger LLMs improve the visual perception of LVLMs. In summary, our experiments suggest that both training data and model architectures should be improved to enhance the visual perception capabilities of LVLMs. The datasets, code, and model responses are provided at https://github.com/psunlpgroup/VisOnlyQA.

Authors (5)

Ryo Kamoi (14 papers)
Yusen Zhang (30 papers)
Sarkar Snigdha Sarathi Das (17 papers)
Ranran Haoran Zhang (10 papers)
Rui Zhang (1138 papers)

Citations (1)

View on Semantic Scholar

Summary

Analysis of Visual Perception Capabilities in Large Vision LLMs: Insights from VisOnlyQA

This paper addresses a fundamental issue in the application of Large Vision LLMs (LVLMs), particularly their propensity for visual perception errors when interpreting geometric and numerical information from images. Despite the iterative improvements in LVLM architectures, as evidenced by models such as GPT-4o and Gemini 1.5 Pro, the research reveals a persistent gap in visual information processing compared to human performance.

The authors introduce VisOnlyQA, a novel dataset precisely engineered to evaluate the visual perception capabilities of LVLMs independently of their reasoning and knowledge-based abilities. VisOnlyQA comprises a meticulously curated collection of 1,200 questions structured into twelve tasks, each targeting different types of figures: geometric shapes, chemical structures, charts, and 3D shapes. Additionally, the authors supply synthetic training data featuring 70k instances to aid LVLM development.

Key Findings

Performance Discrepancy: The investigation reveals a significant performance gap between state-of-the-art LVLMs and human respondents on VisOnlyQA. Even advanced models like GPT-4o and Gemini 1.5 Pro demonstrate performance levels (51.4% and 54.2% accuracy, respectively) that are notably inferior to human accuracy, which reaches 93.5%.
Limited Improvements from Fine-tuning: While fine-tuning LVLMs on synthetic data shows potential in enhancing visual perception for specific models and tasks, the improvements are neither consistent nor universal. Certain tasks and models benefited more from fine-tuning, suggesting that dataset-specific training can aid but not fully resolve existing deficiencies.
Influence of LLMs: The architecture of the LLM within the LVLMs significantly impacts visual processing capabilities. Models utilizing larger language configurations demonstrate superior performance, indicating that LLMs contribute to processing and interpreting encoded visual information.

Implications for Future Research

The outcomes from the VisOnlyQA dataset highlight the need for targeted advancements in both dataset construction and model architecture to address visual perception challenges in LVLMs. For practitioners and researchers in AI, these findings underscore the necessity to refine both the training paradigms and model architectures to include diverse and comprehensive visual examples. Moreover, it's clear that simply scaling up model parameters or fine-tuning on synthetic datasets isn't sufficient; fundamental changes in how visual data is encoded and processed are essential.

Future Directions

The paper suggests several plausible directions for future research. First, expanding the dataset to include even more varied scientific figures may better expose weaknesses in LVLMs, prompting further optimization. Second, exploring novel model architectures that inherently fuse language and vision modalities more effectively could provide substantial improvements. Lastly, deeper analysis of how visual and language components interact within these models could lead to groundbreaking insights into the design of LVLMs with enhanced visual perception capabilities.

In conclusion, VisOnlyQA represents a significant step toward understanding and resolving the challenges faced by LVLMs in visual perception tasks. This research provides a focal point for future studies aiming to bridge the performance gap between AI and human visual understanding, ultimately contributing to the development of more robust and capable vision-language systems.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - psunlpgroup/VisOnlyQA: This repository contains the code and data for the paper "VisOnlyQA: Large Vision Language Models Still Struggle with Visual Perception of Geometric Information" (5 stars)

Tweets

https://twitter.com/RyoKamoi/status/1864384300390420968

https://twitter.com/RyoKamoi/status/1865056902201123103

https://twitter.com/fly51fly/status/1865521792011706516

https://twitter.com/arXivGPT/status/1864374091458748630