ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-Language Models (2505.13444v1)

Published 19 May 2025 in cs.CL and cs.CV

Abstract: Chart understanding presents a unique challenge for large vision-LLMs (LVLMs), as it requires the integration of sophisticated textual and visual reasoning capabilities. However, current LVLMs exhibit a notable imbalance between these skills, falling short on visual reasoning that is difficult to perform in text. We conduct a case study using a synthetic dataset solvable only through visual reasoning and show that model performance degrades significantly with increasing visual complexity, while human performance remains robust. We then introduce ChartMuseum, a new Chart Question Answering (QA) benchmark containing 1,162 expert-annotated questions spanning multiple reasoning types, curated from real-world charts across 184 sources, specifically built to evaluate complex visual and textual reasoning. Unlike prior chart understanding benchmarks -- where frontier models perform similarly and near saturation -- our benchmark exposes a substantial gap between model and human performance, while effectively differentiating model capabilities: although humans achieve 93% accuracy, the best-performing model Gemini-2.5-Pro attains only 63.0%, and the leading open-source LVLM Qwen2.5-VL-72B-Instruct achieves only 38.5%. Moreover, on questions requiring primarily visual reasoning, all models experience a 35%-55% performance drop from text-reasoning-heavy question performance. Lastly, our qualitative error analysis reveals specific categories of visual reasoning that are challenging for current LVLMs.

PDF Abstract

Overview of ChartMuseum: Visual Reasoning in Large Vision-LLMs

The paper, "ChartMuseum: Testing Visual Reasoning Capabilities of Large Vision-LLMs," discusses the development and evaluation of a new benchmark for assessing the visual reasoning capabilities of large vision-LLMs (LVLMs). The primary focus is the gap between textual and visual reasoning skills within these models, highlighting that visual reasoning remains a challenging area for LVLMs.

ChartMuseum is introduced as a benchmark specifically curated to evaluate complex visual and textual reasoning via a diverse set of charts and questions. The benchmark comprises 1,162 expert-annotated questions from real-world charts sourced across 184 domains. Importantly, all questions in ChartMuseum were created by human annotators without machine assistance, ensuring high-quality, realistic assessments. The benchmark reveals that even the best-performing models fall significantly short of human-level accuracy, with humans achieving 93% accuracy, while the leading model, Gemini-2.5-Pro, achieves only 63.0%.

Key Findings and Contributions

Visual Reasoning Limitation: The paper identifies a major performance drop in LVLMs for questions requiring visual reasoning compared to those needing textual reasoning. Models like GPT-4.1 experience a 50% absolute accuracy decline from textual to visual tasks. This underlines a persistent area for improvement in LVLM development.
ChartMuseum Benchmark: The benchmark sets itself apart by exposing performance disparities across models and emphasizing the difficulty in visual reasoning. Its design prevents the saturation seen in previous benchmarks, offering a more rigorous evaluation of model capabilities.
Synthetic Dataset Experiment: Through experiments with synthetic charts requiring visual reasoning, the paper demonstrates model performance degradation as chart complexity increases, contrasting with consistent human accuracy. This reinforces the need for benchmarks testing beyond basic text extraction tasks.
Error Analysis: The error analysis categorizes common shortcomings in model visual reasoning tasks, such as symbol selection, visual comparison, trajectory tracking, and x/y value identification. Models tend to over-rely on textual strategies even when visual reasoning is more efficient.

Implications and Future Directions

This research has implications for future LVLM design, especially for applications requiring robust visual reasoning capabilities. The large performance gap between models and humans suggests a need for improved vision encoders and alignment of visual features within models. Insights drawn from ChartMuseum could inform more effective training paradigms and the development of specialized architectures to enhance visual reasoning.

In conclusion, ChartMuseum sets a new standard for evaluating LVLMs by focusing on the nuanced challenges of visual reasoning. The benchmark can guide future advancements in multimodal AI, driving research to develop models that approach human-level performance in both textual and visual reasoning tasks.

PDF Markdown Bookmark Chat (Pro)

Authors (15)

Liyan Tang (12 papers)
Grace Kim (5 papers)
Xinyu Zhao (54 papers)
Thom Lake (5 papers)
Wenxuan Ding (14 papers)
Fangcong Yin (8 papers)
Prasann Singhal (7 papers)
Manya Wadhwa (8 papers)
Zeyu Leo Liu (5 papers)
Zayne Sprague (10 papers)
Ramya Namuduri (2 papers)
Bodun Hu (6 papers)
Juan Diego Rodriguez (12 papers)
Puyuan Peng (21 papers)
Greg Durrett (117 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/LiyanTang4/status/1924839816773980291

https://twitter.com/GptMaestro/status/1936318337942012059

YouTube

Show All Videos