CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

Published 3 Apr 2026 in cs.AI | (2604.02794v1)

Abstract: Charts are ubiquitous in scientific and financial literature for presenting structured data. However, chart reasoning remains challenging for multimodal LLMs (MLLMs) due to the lack of high-quality training data, as well as the need for fine-grained visual grounding and precise numerical computation. To address these challenges, we first propose DuoChart, a scalable dual-source data pipeline that combines synthesized charts with real-world charts to construct diverse, high-quality chart training data. We then introduce CharTool, which equips MLLMs with external tools, including image cropping for localized visual perception and code-based computation for accurate numerical reasoning. Through agentic reinforcement learning on DuoChart, CharTool learns tool-integrated reasoning grounded in chart content. Extensive experiments on six chart benchmarks show that our method consistently improves over strong MLLM baselines across model scales. Notably, CharTool-7B outperforms the base model by +8.0% on CharXiv (Reasoning) and +9.78% on ChartQAPro, while achieving competitive performance with substantially larger or proprietary models. Moreover, CharTool demonstrates positive generalization to out-of-domain visual math reasoning benchmarks.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper introduces CharTool, a tool-integrated framework that augments multimodal reasoning with image cropping and code computation for detailed chart analysis.
It leverages the DuoChart dataset, combining synthetic and real-world charts to capture diverse visual layouts and improve training fidelity.
Experimental results demonstrate that CharTool outperforms leading models on multiple benchmarks and generalizes to out-of-domain visual-mathematical tasks.

CharTool: Tool-Integrated Visual Reasoning for Chart Understanding

Motivation and Problem Setting

Chart understanding poses persistent challenges for multimodal LLMs (MLLMs) due to the dual necessity of fine-grained visual grounding and precise numerical reasoning. Conventional approaches, often based on synthetic template-driven data, lack the visual diversity and complexity encountered in real-world scientific and financial charts. Furthermore, typical MLLM architectures relying purely on internal textual reasoning exhibit limited capacity for localized perceptual analysis and explicit computation, resulting in poor generalization to complex chart layouts and dense multi-subplot figures.

Figure 1: Chart reasoning mandates both localized visual perception and intricate numerical analysis; synthetic datasets often lack chart diversity, and text-only reasoning produces errors on complex charts, in contrast to tool-grounded approaches.

DuoChart: Dual-Source Chart Data Synthesis

The paper introduces DuoChart, a data engine leveraging both synthesizable and real-world chart distributions to construct training datasets for chart reasoning. Synthetic charts, generated via an LLM-based code synthesis pipeline, provide controllable data fidelity and label precision, while mined real-world arXiv figures contribute essential visual richness and diverse spatial/structural configurations otherwise absent from code-driven methods. After an LLM-assisted, multi-stage filtering and alignment process, the authors construct a high-entropy data resource combining both chart types.

Figure 2: The DuoChart pipeline constructs charts through (A) dual-source synthesis/mining, (B) guided QA generation with multi-stage filtering, and (C) synthesis of demonstration trajectories for cold-start training.

Figure 3: DuoChart achieves substantial coverage in chart type and question diversity, supporting broad generalization.

This hybrid construction yields DuoChart-100k, a dataset with challenging QA grounded in both synthetic and authentic figures. Evaluation metrics (e.g., GPT-5.2 judgment for visual quality, answer validity, and reasoning complexity) demonstrate superior fidelity and harder visual/numerical tasks relative to prior synthetic chart datasets.

Figure 4: DuoChart outperforms previous datasets in visual quality, chart entropy, answer correctness, and reasoning complexity.

CharTool: Tool-Augmented Multimodal Reasoning

CharTool enables MLLMs to invoke external tools for explicit, symbolic manipulation during chart question answering. Two core tool modalities are employed:

Image Cropping Tool: Enables fine-grained perceptual grounding by extracting contextually relevant subregions of the chart, supporting localized question answering and disambiguation in multi-subplot/multi-panel situations.
Code Computation Tool: Provides executable, programmatic interfaces for precise numerical manipulation, supporting aggregation, statistical calculation, and structured data parsing from chart content.

Interaction between internal MLLM states and external tool responses is cast as a multi-turn trajectory optimized by agentic reinforcement learning. The policy $\pi_\theta$ alternates internal reasoning steps with tool invocation, receiving trajectory-level rewards based on answer correctness, structural compliance, and productive tool use.

Figure 5: Distribution of tool calls by benchmark shows adaptive use—perception-dominant tasks trigger cropping, while computation-heavy benchmarks favor code-based execution.

Exemplar tool use trajectories further demonstrate CharTool’s strategic invocation of cropping and calculation tools for complex multi-step reasoning.

Figure 6: Example of CharTool using the Crop Tool for precise visual localization in a chart.

Figure 7: Example of CharTool using the Code Computation Tool for explicit numerical reasoning.

Figure 8: CharTool example with both Crop Tool and Code Computation Tool in tandem for a multi-step analysis.

Experimental Evaluation

CharTool is benchmarked on six chart QA datasets—spanning both challenging real-world tasks (e.g., CharXiv, ChartQAPro) and synthetic environments (e.g., ReachQA, ChartBench)—as well as on out-of-domain visual math reasoning tasks (MathVista, WeMath, MathVerse).

Key results:

On CharXiv (Reasoning), CharTool-7B outperforms the Qwen2.5-VL-7B backbone by +8.0% absolute accuracy, and on ChartQAPro by +9.78%.
CharTool-7B (66.52 avg) matches or exceeds proprietary systems like GPT-4o (61.52) and the much larger Qwen2.5-VL-72B (66.02), demonstrating parameter efficiency.
On out-of-domain benchmarks, tool integration generalizes to improved visual mathematical reasoning (+2.2% on WeMath, for example).
Ablations establish that neither chain-of-thought prompting nor language-only RL approaches approach the performance gains of tool-integrated RL.
Analysis of tool usage reveals dynamic adaptation: Crop Tool is favored in visually dense, subplot-rich settings; Code Computation dominates in tasks requiring numerical analysis.

Theoretical and Practical Implications

CharTool demonstrates that explicit external tool integration—particularly when optimized with trajectory-level RL—yields significant gains over traditional MLLM self-reasoning. This strongly supports a modular, agentic view of multimodal reasoning: rather than attempting to ground all visual and numerical reasoning implicitly within the LLM backbone, judicious division of labor between internal ‘cognitive’ states and external, executable operations produces more robust, interpretable, and generalizable systems.

Practically, such approaches facilitate reliable chart interpretation in scientific and financial settings where both localized perception and rigorous calculations are critical. Moreover, the demonstrable transfer of tool-integrated reasoning to out-of-domain tasks suggests potential for extensibility in other structured visual domains (e.g., tables, schematic diagrams).

Speculation on Future Directions

Prospective extensions include:

Expansion of the toolset to include OCR, region-cluster detection, or formula parsing, improving reasoning over more heterogeneous vision-language tasks.
Iterative, fine-grained reward schemes for even more agentic multi-step reasoning, potentially incorporating subgoal-level supervision.
Porting the tool-integrated reasoning paradigm to related domains such as table-based QA, document reading, or complex instructional image understanding, with appropriate interface abstraction.
Investigating architectures for end-to-end trainable tool parameter selection or composition, reducing hardcoded tool interfaces in favor of higher-level learning.

Conclusion

This work establishes a robust methodology for tool-integrated chart reasoning, combining dual-source high-quality data, targeted agentic training, and adaptive external tool interfaces. CharTool consistently advances performance in chart question answering and demonstrates improved generalization to broader visual reasoning tasks, evidencing the utility of explicit, agent-augmented multimodal frameworks for structured data understanding (2604.02794).

Markdown Report Issue