ChartRQA: Interpretable Chart Reasoning

Updated 3 July 2026

ChartRQA is a vision-language chart reasoning framework that integrates programmatic data synthesis, chain-of-thought supervision, and numerically sensitive reinforcement learning.
It employs transformer-based architectures and modular compositional reasoning to decompose complex chart queries into verifiable, stepwise logical operations.
ChartRQA establishes state-of-the-art performance on chart analysis benchmarks and offers a scalable ecosystem for robust, interpretable reasoning in scientific charts.

ChartRQA is a vision-language chart reasoning framework and dataset ecosystem that addresses the problem of complex question answering with step-wise, interpretable reasoning over scientific charts. The recent research axis surrounding ChartRQA emphasizes fine-grained chain-of-thought (CoT) supervision, numerically sensitive reinforcement learning, programmatic data synthesis, and compositional, verifiable reasoning architectures. ChartRQA and its related models are situated at the intersection of multimodal machine learning, mathematical reasoning, and robust evaluation for real-world chart analysis.

1. Problem Definition and Motivation

Chart reasoning question answering (Chart QA; CQA) is the task of generating answers to complex natural language questions that require extracting and reasoning over information embedded in the visual and structural elements of charts (bar, line, pie, etc.), often originating from scientific publications and reports. Traditional methods follow the pipeline: chart → table (via OCR or parsing) → logical form → answer. However, such approaches tend to struggle with context dependence, multi-step logical operations, textless or visually ambiguous charts, and lack of transparency in intermediate reasoning steps.

The ChartRQA ecosystem extends this by introducing:

Large-scale, programmatically synthesized datasets with ground-truth stepwise reasoning traces,
Explicit CoT supervision, making each atomic subgoal in reasoning learnable,
Numerically precise reward shaping for reinforcement learning,
Modular architectures capable of compositional, interpretable reasoning over multimodal chart content (Chen et al., 21 Jul 2025).

2. Programmatic Data Synthesis and ChartRQA Corpus

The ChartRQA corpus is created using a programmatic synthesis pipeline that leverages real-world arXiv tables, Matplotlib code templates, and state-of-the-art LLMs to generate chart images and associated reasoning tasks:

Seed Data: Real tabular data from arXiv papers.
LLM-Generated Matplotlib Code: Prompted to produce validated charts across 24 types, including complex multi-subplot layouts.
Reasoning Traces: LLMs generate question, detailed stepwise CoT (enclosed in > ...</think>), and final answer (enclosed in <answer>...</answer>), explicitly requiring cross-chart and multi-hop operations for multi-subplot figures.
- Quality Control: Automated filtering for syntactic/semantic consistency, with human benchmarking (1,702 samples) for accuracy, difficulty, and reasoning fidelity.
Notable statistics:
- 258,000 training instances (93,300 unique images)
- Split into SFT/CoT (228,000) and RL (30,000) sets
- Average CoT trace: 196–240 tokens
- Test set: 933 single-chart and 769 multi-chart human-verified samples (Chen et al., 21 Jul 2025).
3. Chain-of-Thought Supervision and Model Architectures

ChartRQA models employ explicit chain-of-thought supervision ("Chart-COT") via SFT on annotated CoT traces. Key architectural and training points:
- Model: Transformer-based decoder (Qwen2.5-VL-7B) with a lightweight vision encoder.
- Training: The model outputs the entire <think>... reasoning chain, segmented into atomic operations (e.g., "Identify bars for category Q1," "Sum their values"), followed by final answer emission. Instruction templates guide both single- and multi-chart decomposition.
Loss Function: Negative log-likelihood over concatenated reasoning/answer tokens.
The CoT phase initializes policies that support deeper, more consistent RL-stage outputs. Ablation shows that RL without SFT yields shallow, less coherent reasoning (Chen et al., 21 Jul 2025).

4. Numerically Sensitive Reinforcement Learning

After SFT, ChartRQA models are further fine-tuned using a numerically sensitive RL regime ("Chart-RFT"), leveraging Group Relative Policy Optimization (GRPO):

GRPO Objective: Samples multiple responses per prompt, scoring via reward functions and updating the policy using clipped advantages and KL penalties to anchor policies to the SFT baseline.
Reward Functions:
- r_format: Ensures correct CoT and answer tag structure.
- r_semantic: Normalized edit-distance between output and reference.
- r_numeric: Soft numeric reward; full score if numeric answer is within 5% of ground-truth.
- Combined as $R(τ) = α r_{\mathrm{semantic}} + β r_{\mathrm{numeric}}$ , tuning for optimal string/numeric balance.
Policy Update: Only responses passing both format and accuracy constraints receive positive reinforcement.

This design explicitly prioritizes numerically correct reasoning, critical for scientific or quantitative chart tasks (Chen et al., 21 Jul 2025).

5. Compositional and Verifiable Reasoning Paradigms

ChartRQA integrates state-of-the-art compositional reasoning modules and verifiable stepwise supervision:

Graph-of-Thought (GoT) Guidance: As proposed in GoT-CQA, questions are decomposed into directed acyclic graphs of operator nodes—localization (Loc), numerical (Num), and logical (Log)—each block realized as a small transformer acting on learned chart features and textual prompts (Zhang et al., 2024).
Visual Premise Proving (VPP): RealCQA-V2 frames reasoning as a chain of first-order logic premises (structural, data, reasoning, mathematical), with models required to verify each premise and achieve intermediate logical consistency, evaluated via stepwise correctness metrics (e.g., Acc_VPP, DCP), exposing failures in data extraction or logical combination (Ahmed et al., 2024).
Architectural Modularity: Systems can replace monolithic LLM parsing with compositional pipelines—chart encoding, operator-specific transformer blocks, supervision at each intermediate output—yielding both interpretability and extensibility (e.g., addition of new operator types such as “Trend” or “Filter”).

6. Empirical Performance and Comparative Results

ChartRQA-based systems establish new state-of-the-art results across multiple benchmarks:

ChartRQA Benchmark (Single/Multi-chart):
- Chart-R1-7B (Chart-COT + Chart-RFT): 52.09 / 49.93
- Qwen2.5-VL-7B baseline: 44.59 / 40.57
- GPT-4o (proprietary): 44.37 / 46.55
- Gemini-2.5-Flash: 59.12 / 59.17 (Chen et al., 21 Jul 2025)
ChartQA: 91.04% (Chart-R1) vs. 87.3% (Qwen2.5-VL-7B)
CharXiv-RQ (reasoning split): 46.2% (Chart-R1) vs. 42.5% (baseline)
Generalization and Robustness: Chart-RL demonstrates strong robustness under perturbed chart styles/layouts (performing best in 18/25 categories), and improved transfer even with few exemplars if complex reasoning is present. For the hard task setting, RL delivers a 16.7% improvement on MultiChartQA and 11.5% on ChartInsights (Zhang et al., 7 Mar 2026).
Ablations: Chain-of-thought SFT followed by RL is critical—without SFT, policies are shallow; overlapping SFT and RL data leads to overfitting; hard tasks induce stronger generalization than simply increasing data volume (Chen et al., 21 Jul 2025, Zhang et al., 7 Mar 2026).

7. Extensions, Limitations, and Future Directions

ChartRQA frameworks provide a basis for extensible, reproducible, and interpretable chart question answering:

Integrations and Adaptations: ChartRQA pipelines can directly adopt GoT-style decomposition (Zhang et al., 2024), VPP-style premise-level supervision (Ahmed et al., 2024), or dual-phase visual-language alignment from mChartQA (Wei et al., 2024), with tailored plug-ins for chart parsing or decoding as needed.
Limitations:
- Most ChartRQA instances are synthesized using scientific (Matplotlib-styled) charts, potentially limiting generalizability to “in the wild” dashboards (e.g., PowerPoint, web plots).
- Reward or supervision structures relying on verifiable numeric ground-truth may not address open-ended or interpretive chart questions.
- Complex chart-query pairs for hard reasoning require manual or LLM-aided curation.
Emerging Directions:
- Expansion to real-world and user-created charts for greater stylistic diversity.
- Learned critics for reward refinement, especially for subjective or ambiguous outcomes.
- Integration of metadata retrieval and dynamic operator sets for general-purpose compositional reasoning.
- Application of programmatic data synthesis and CoT supervision beyond charts (e.g., to scientific figures, diagrammatic math problems) (Chen et al., 21 Jul 2025, Zhang et al., 7 Mar 2026, Wei et al., 2024).

In summary, ChartRQA represents a convergence of code-centric data generation, interpretable multimodal supervision, targeted reinforcement learning, and symbolic-compositional reasoning—anchored by stepwise, verifiable program annotations—as the foundation for advanced, reliable chart question answering.