ChartMaster: Multimodal Chart-to-Code Model

Updated 3 July 2026

ChartMaster is a multimodal LLM system for chart-to-code tasks, integrating advanced vision–language modeling with reinforcement learning to capture detailed visual features.
It leverages the ReChartPrompt-240K dataset of 242K chart–code pairs from scientific literature, ensuring diverse, high-fidelity chart reproduction through rigorous quality control.
By combining supervised fine-tuning with ChartSimRL, ChartMaster achieves state-of-the-art execution and visual similarity metrics, competing with methods like GPT-4o.

ChartMaster is a multimodal LLM (MLLM) system specifically designed for the chart-to-code generation problem, representing an overview of recent advances in vision–language modeling, large-scale real-world dataset harvesting, and reinforcement learning driven by perceptual similarity metrics. ChartMaster sets state-of-the-art performance among open-source 7B-parameter models and performs competitively with commercial closed-source systems like GPT-4o, notably on complex, visually-diverse chart inputs drawn from scientific literature (Tan et al., 25 Aug 2025).

1. Real-World Dataset Construction: ReChartPrompt-240K

ChartMaster’s training regime is anchored in the ReChartPrompt-240K dataset, which consists of 242,479 high-fidelity chart–code pairs harvested from arXiv papers. The construction process involved:

Crawling and Extraction: 30,071 arXiv papers were parsed for images (.pdf, .png, .jpg), yielding 288,992 raw images.
Chart Filtering: Images were classified among 12 supported chart types using Qwen2.5-VL-72B and a chart-type prompt; only images recognized as charts were retained.
Code Generation via Vision-Language LLMs: For each chart, a randomly-selected prompt from a 20-template pool (“Please generate Python matplotlib code to recreate the picture shown,” etc.) was paired with the image and passed to Qwen2.5-VL-72B. The LLM generated candidate Python/matplotlib code.
Execution and Quality Control: Generated code was executed headlessly. Triplets where code execution failed or the output diverged visibly from the source chart were discarded. Quality metrics included execution success and visual fidelity.

Dataset diversity was explicitly measured: ReChartPrompt-160K exhibited 30–50% more unique attributes (numeric, color, text, layout) than the prior synthetic Chart2Code-160K. All code is in Python/matplotlib; no image augmentations were applied (Tan et al., 25 Aug 2025).

2. Model Architecture and Supervised Fine-Tuning

ChartMaster uses Qwen2.5-VL-7B as its backbone: a multimodal transformer with ~24 layers, incorporating both a vision projection head (to process image patches) and a standard language head (for code token decoding).

Supervised Fine-Tuning (SFT) Objective: The primary SFT loss maximizes the conditional likelihood of the ground-truth code $Y_i$ given chart $I_i$ and instruction $T_i$ :

$J_{\mathrm{SFT}}(\theta) = -\frac{1}{N}\sum_{i=1}^N \log \pi_\theta(Y_i | I_i, T_i)$

Hyperparameters: Learning rate $2 \times 10^{-5}$ , batch size 128, cosine-annealed for 1 epoch over all 242K samples. The resulting checkpoint is reused as the reference policy for reinforcement learning.

3. Chart Similarity Reinforcement Learning (ChartSimRL)

ChartMaster addresses the core challenge of visual–semantic consistency through ChartSimRL, a reinforcement learning algorithm based on Group Relative Policy Optimization (GRPO).

MDP Structure:

State: The model’s latent given $(I, T)$ and partial code.
Action: Next code token.
Reward: Assigned after executing the full code. Zero if execution fails; otherwise, the reward is a function of chart similarity (multimodal).

Reward Structure:

$R_i = R_i^{\mathrm{attr}} + R_i^{\mathrm{vis}}$

Attribute Similarity ( $R_i^{\mathrm{attr}}$ ): Jaccard similarity between sets of chart attributes (numeric, color, text, layout), extracted from ground-truth and generated charts. Numeric values $a, b$ are considered equal if $|a-b| \leq 0.01|b|$ .
Visual Similarity ( $I_i$ 0): Cosine similarity over feature maps from each of four ResNet-18 residual blocks, averaged across all blocks, between the original and generated chart images.

GRPO Update: For each prompt, $I_i$ 1 candidate codes are generated. Each is executed, scored, and assigned a normalized group advantage:

$I_i$ 2

Parameters are updated with a PPO-style surrogate, clipped with $I_i$ 3 and a KL-divergence penalty to the SFT reference.

Training Regime: RL is run for one pass over a 10% subset (24K samples) with batch size 32 prompts (128 candidates), learning rate $I_i$ 4 (Tan et al., 25 Aug 2025).

4. Evaluation Benchmarks and Quantitative Results

ChartMaster’s performance was rigorously assessed on multiple chart-to-code benchmarks:

Benchmark	Metric / Description	ChartMaster-7B	GPT-4o	Best Prior Open-Source
ChartMimic	Execution Rate / F1 (text,color,layout,type)/GPT-4O score	93.8% / 78.2% / 77.3	93.2% / 79.0% / 83.5	91.4% / 77.4% / 74.0
Plot2Code	Pass Rate / Text Match / GPT-4V rating	88.2% / 62.6% / 5.65	—	67.4% / 43.8% / 4.60
ChartX	GPT-4 human rating (0-5)	2.46	2.53	2.18

Ablation studies show that SFT on ReChartPrompt accounts for the bulk of the gains (boosting execution/F1/rating to 91.1/73.7/73.3), while ChartSimRL further increases all metrics (up to 93.8/78.2/77.3). Both reward components contribute additively; ResNet-18 feature similarity clearly outperformed MSE, SSIM, PSNR, and other perceptual metrics (Tan et al., 25 Aug 2025).

5. Qualitative Analysis

ChartMaster models, after RL, faithfully replicate not only the structure but also the layout, color, axis scales, and annotation positions of input charts. Qualitative inspection demonstrates:

Accurate reproduction of complex visual cues—tick locations, legend placements, marker styles—across varied scientific domains.
Minor residual errors on highly complex figures, e.g., multi-panel layouts or uncommon annotation forms.
Rare overfitting to low-frequency visual texture details (e.g., background grid lines), indicating that multimodal RL emphasizes pixel-level detail (Tan et al., 25 Aug 2025).

6. Significance, Limitations, and Future Directions

Significance: ChartMaster establishes a new paradigm by combining massive, visually-diverse real-world data (from arXiv), a practically effective RL engine enforcing multimodal similarity, and rigorous quantitative evaluation, resulting in performance previously unattainable for open-source models of comparable scale. The joint attribute–visual reward is especially critical for achieving not only syntactic but also perceptual chart reproduction fidelity.

Limitations:

All code is Python/matplotlib-specific; no current support for R/ggplot2 or JS/D3.
Some failure cases remain for charts with complex subplots or uncommon legend forms.
Only chart types in the 12 annotated categories are covered; diagrams and non-statistical figures are excluded (Tan et al., 25 Aug 2025).

Future Directions:

Expansion to additional charting grammars (R/ggplot2, JavaScript/D3).
Incorporation of higher-level “layout planning” reasoning for multi-panel figures.
Extension of RL to diagram-to-code tasks (flowcharts, UML).
Human-in-the-loop RL for rare corner-case coverage.

ChartMaster’s methodology—leveraging dataset realism, vision-language code generation, and fine-grained multimodal RL—constitutes a reference blueprint for future chart-to-code and visual reasoning systems.

Markdown Report Issue Upgrade to Chat

References (1)

ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ChartMaster.