CharXiv: Charting Gaps in Realistic Chart Understanding in Multimodal LLMs (2406.18521v1)

Published 26 Jun 2024 in cs.CL and cs.CV

Abstract: Chart understanding plays a pivotal role when applying Multimodal LLMs (MLLMs) to real-world tasks such as analyzing scientific papers or financial reports. However, existing datasets often focus on oversimplified and homogeneous charts with template-based questions, leading to an over-optimistic measure of progress. We demonstrate that although open-source models can appear to outperform strong proprietary models on these benchmarks, a simple stress test with slightly different charts or questions can deteriorate performance by up to 34.5%. In this work, we propose CharXiv, a comprehensive evaluation suite involving 2,323 natural, challenging, and diverse charts from arXiv papers. CharXiv includes two types of questions: 1) descriptive questions about examining basic chart elements and 2) reasoning questions that require synthesizing information across complex visual elements in the chart. To ensure quality, all charts and questions are handpicked, curated, and verified by human experts. Our results reveal a substantial, previously underestimated gap between the reasoning skills of the strongest proprietary model (i.e., GPT-4o), which achieves 47.1% accuracy, and the strongest open-source model (i.e., InternVL Chat V1.5), which achieves 29.2%. All models lag far behind human performance of 80.5%, underscoring weaknesses in the chart understanding capabilities of existing MLLMs. We hope CharXiv facilitates future research on MLLM chart understanding by providing a more realistic and faithful measure of progress. Project page and leaderboard: https://charxiv.github.io/

PDF Abstract

Evaluation of Chart Understanding Capabilities in Multimodal LLMs

The paper "ChCharting Gaps in Realistic Chart Understanding in Multimodal LLMs" presents a thorough empirical paper of the chart comprehension capabilities of Multimodal LLMs (MLLMs). The researchers introduce CharXiv, a benchmark specifically designed to address the limitations seen in earlier benchmarks that assess chart understanding through homogeneous, template-driven questions.

The CharXiv benchmark comprises 2,323 real-world charts extracted from a diverse array of academic subjects published on arXiv. Each chart is paired with descriptive and reasoning questions that necessitate a nuanced understanding of complex visual elements and related numerical data, challenging existing MLLMs to demonstrate their reasoning capabilities.

Key Findings

Benchmark Design and Purpose: The CharXiv benchmark is curated with meticulous attention to test diverse chart-related understanding, utilizing questions that go beyond superficial attributes to challenge MLLMs' reasoning faculties genuinely. It aims to provide a more representative measure of multi-modal model abilities than previously available evaluation datasets like FigureQA and DVQA.
Evaluation Results: Results demonstrate marked performance discrepancies between strong proprietary models and their open-source counterparts. The proprietary GPT-4o model, although leading with a 47.1% accuracy in reasoning questions, exhibits a substantial performance gap compared to human capabilities, which stands at 80.5%. The best-performing open-source model (InternVL Chat V1.5) achieves only a 29.2% accuracy. These figures illustrate that while proprietary models display superior chart understanding, all models are significantly deficient relative to human performance.
Response Analysis: Detailed analysis reveals that the descriptive capabilities often underpin reasoning success; models with high accuracy in descriptive tasks tend to perform better in reasoning tasks. This correlation suggests that foundational understanding is key to complex reasoning in MLLMs.
Robustness to Subplot Complexity: As revealed in further analysis, the accuracy of proprietary and open-source models deteriorates significantly when processing charts with many subplots, indicating a weakness in handling compositional tasks.
Sensitivity to Simple Changes: The paper highlights marked reductions in model accuracies when subjected to changes in evaluation components, such as perturbations in question phrasing or chart elements, underscoring the limited robustness of current MLLMs.

Implications and Future Directions

The research implies that while MLLMs are becoming proficient in handling structured data within controlled settings, their ability to generalize and perform complex reasoning on diverse, real-world data remains limited. CharXiv serves as a benchmark facilitating advancements in this domain by emphasizing weaknesses for future improvements in MLLM design and training.

This paper encourages researchers to enhance the robustness of MLLMs through innovative training methodologies, which could include diverse and domain-spanning data augmentation strategies or the cultivation of cross-domain reasoning abilities via multi-task learning frameworks. Moreover, the incorporation of fine-grained, explanation-driven evaluation methods, such as Chain-of-Thought prompting, may aid in analyzing and improving reasoning paths in model predictions.

In conclusion, the CharXiv benchmark establishes a foundational step toward realistic chart understanding in the domain of MLLMs. Its dataset and evaluation metrics provide critical insights into the existing limitations of such models, paving the way for future research aiming to close the gap between machine and human chart reasoning performance.