- The paper introduces ChartMaster, which combines a real-world chart dataset (ReChartPrompt) with a novel reinforcement learning approach (ChartSimRL) to boost chart-to-code generation.
 
        - It demonstrates that the ChartMaster-7B model achieves state-of-the-art performance, rivaling larger models like GPT-4o on multiple evaluation benchmarks.
 
        - The methodology employs a dual reward design—using Jaccard similarity for attributes and ResNet-18 for visual features—to ensure both semantic accuracy and visual fidelity.
 
    
   
 
      ChartMaster: Advancing Chart-to-Code Generation with Real-World Charts and Chart Similarity Reinforcement Learning
Introduction
ChartMaster addresses the chart-to-code generation task, which requires Multimodal LLMs (MLLMs) to convert chart images into executable code, typically for scientific visualization (e.g., Python matplotlib). The paper identifies two primary challenges: (1) limited data diversity in existing datasets, which are predominantly synthesized from attribute seeds, and (2) insufficient maintenance of visual consistency between generated and original charts during model training. ChartMaster introduces two key innovations: ReChartPrompt, a large-scale dataset constructed from real-world charts extracted from arXiv papers, and ChartSimRL, a reinforcement learning algorithm leveraging a multimodal chart similarity reward. The resulting ChartMaster-7B model achieves state-of-the-art performance among open-source 7B-scale models and approaches the capabilities of GPT-4o on multiple benchmarks.
Dataset Construction: ReChartPrompt
The ReChartPrompt pipeline systematically extracts chart images from 30,071 arXiv papers, yielding 288,992 candidate images. A chart type classifier based on Qwen2.5-VL-72B filters non-chart images, retaining only those belonging to 12 predefined chart categories. For each chart, Qwen2.5-VL-72B is prompted with one of 20 diverse chart-to-code instructions to generate Python code intended to reproduce the chart. Generated code is executed, and only successful runs are retained, resulting in 242,479 high-quality chart-image/code/instruction triplets (ReChartPrompt-240K).
This approach yields a dataset with substantially greater diversity in chart attributes (numeric, color, text, layout) compared to prior synthetic datasets such as Chart2Code-160K. Attribute extraction and analysis confirm that ReChartPrompt-240K contains richer and less redundant attribute distributions, which directly translates to improved model generalization and robustness.
Model Training: SFT and ChartSimRL
ChartMaster is built on the Qwen2.5-VL-7B architecture and trained in two stages:
Supervised Fine-Tuning (SFT)
SFT is performed on the ReChartPrompt-240K dataset using standard next-token prediction, maximizing the likelihood of ground-truth code given the chart image and instruction. This establishes a strong baseline for chart-to-code generation, with significant improvements over models trained on synthetic data.
ChartSimRL: Multimodal Reinforcement Learning
To address the limitations of SFT in maintaining visual fidelity, ChartSimRL applies Group Relative Policy Optimization (GRPO) with a novel chart similarity reward. For each training sample, the model samples M candidate codes, executes them, and compares the resulting chart images to the reference using two reward components:
- Attribute Similarity (Riattr​): Jaccard similarity between sets of extracted chart attributes (text, numeric values, layout, color), with tolerance for minor numerical discrepancies.
 
- Visual Similarity (Rivis​): Average cosine similarity between feature vectors extracted from the original and generated chart images using a pretrained ResNet-18 across four residual blocks.
 
The total reward is Ri​=Riattr​+Rivis​. Rewards are normalized within each candidate group, and the GRPO objective is optimized with KL regularization for stability. This multimodal reward design is shown to be superior to conventional text-based or pixel-level metrics, with ablation studies demonstrating that both attribute and visual similarity components are necessary for optimal performance.
Experimental Results
ChartMaster-7B is evaluated on multiple benchmarks: ChartMimic, Plot2Code, and ChartX. It achieves the highest scores among open-source 7B-scale models and rivals GPT-4o in several metrics. Notably, ChartMaster-7B outperforms the larger Qwen2.5-VL-72B model on key benchmarks, despite being trained in a distillation-like setting. Ablation studies confirm the additive benefits of ReChartPrompt and ChartSimRL, with the latter providing consistent gains in both semantic and visual alignment.
Further analysis of reward components reveals that Jaccard similarity is the most effective attribute metric, and ResNet-18-based visual similarity outperforms standard metrics such as MSE, SSIM, and PSNR. Qualitative results demonstrate that ChartMaster-7B produces charts with fine-grained visual details and semantic accuracy, closely matching ground truth references.
Implementation Considerations
- Data Pipeline: Automated extraction and filtering of chart images from arXiv papers, leveraging open-source MLLMs for chart classification and code generation.
 
- Model Training: SFT with large batch sizes and cosine annealing, followed by ChartSimRL with diverse candidate sampling and group-based reward normalization.
 
- Reward Computation: Efficient attribute extraction and CNN-based feature comparison, with robust handling of code execution failures.
 
- Scaling: The framework is designed for extensibility to larger models and datasets, with modular reward components that can be adapted to other chart formats or programming languages.
 
Implications and Future Directions
ChartMaster demonstrates that leveraging real-world chart data and multimodal RL can substantially improve chart-to-code generation, enabling practical applications in automated scientific reporting, data analysis, and intelligent question answering. The framework's modularity allows for extension to additional chart types, programming languages, and more sophisticated reward functions (e.g., incorporating human feedback or domain-specific metrics).
Theoretically, the work highlights the importance of multimodal reward design in RL for MLLMs, suggesting that future research should further explore joint semantic-visual alignment. Practically, ChartMaster provides a scalable blueprint for dataset construction and model optimization in domains where visual fidelity and code correctness are both critical.
Conclusion
ChartMaster integrates a diverse, real-world chart dataset and a multimodal RL algorithm to advance chart-to-code generation. The approach yields state-of-the-art results for open-source 7B-scale models, with performance approaching that of GPT-4o. The combination of ReChartPrompt and ChartSimRL establishes a new standard for both dataset diversity and model fidelity in chart-to-code tasks. Future work should focus on expanding chart formats, supporting additional programming languages, and refining reward functions for even greater generalization and applicability.