Papers
Topics
Authors
Recent
Search
2000 character limit reached

ReChartPrompt-240K: Chart-to-Code Dataset

Updated 3 July 2026
  • ReChartPrompt-240K is a comprehensive, multimodal dataset linking real-world chart images to executable Python/Matplotlib code.
  • It aggregates 242,479 validated image-prompt-code triplets from over 30,000 arXiv papers, ensuring diversity across 12 chart types.
  • The dataset underpins advanced training pipelines like ChartMaster, offering a scalable blueprint for chart-to-code generation and evaluation.

ReChartPrompt-240K is a large-scale, automatically curated dataset designed for the chart-to-code generation task, with a particular emphasis on diversity and realism. Centered on real-world charts extracted from arXiv papers, ReChartPrompt-240K enables the supervised training and evaluation of multimodal LLMs tasked with generating executable code (typically Python/Matplotlib) that reproduces a given chart image. It is integral to the development pipeline of ChartMaster, a state-of-the-art chart-to-code model (Tan et al., 25 Aug 2025).

1. Data Acquisition and Chart Extraction

ReChartPrompt-240K was constructed by processing 30,071 arXiv papers, including LaTeX source and embedded figures in PDF, PNG, or JPEG formats. The selection drew from diverse subject areas by targeting papers published in leading venues such as ICLR and TPAMI. From an initial pool of 288,992 images, a Qwen2.5-VL-72B vision-LLM was used to classify each image into one of 12 predefined chart categories—for example, bar, line, scatter, histogram, pie, box plot, heatmap, area, violin, and radar charts. Images not assigned to these chart types were discarded, ensuring strong domain relevance and visual heterogeneity.

2. Prompt and Code Generation Protocol

Each classified chart image was paired with one of 20 instruction templates (P_rechart), such as “<Real-World Chart> Please generate Python matplotlib code to recreate the picture shown.” The Qwen2.5-VL-72B model was prompted using vLLM (temperature 0.1, top-p 0.9) to generate candidate Python/Matplotlib snippets for each image–prompt pair. The generated code was systematically tested in an execution sandbox; only code that ran successfully without import, syntax, or data errors was retained. No additional human annotation or manual cleaning was employed beyond the success criterion of clean execution.

3. Dataset Structure, Instances, and Filtering

The resulting dataset consists of 242,479 triplets:

  • IiI_i: the original chart image (after possible PDF-to-raster conversion);
  • TiT_i: a single prompt from the P_rechart template pool;
  • YiY_i: the self-contained executable Python/Matplotlib code that reproduces the chart.

Formally, the dataset is represented as

D={(Ii,Ti,Yi)}i=1N,N=242,479\mathcal{D} = \{ (I_i, T_i, Y_i) \}_{i=1}^{N}, \quad N=242{,}479

where each code snippet includes import statements, data definitions, plotting calls (e.g., plt.bar, plt.plot), style customization (colors, markers, legends), and ends with plt.show() or its equivalent. Approximately 46,500 initial image–prompt–code triplets were discarded due to runtime failures, leaving 242,479 valid examples. The dataset contains rich metadata per instance—chart type label, prompt ID, image size/aspect ratio, and the generated code—but no explicit annotations of chart semantics or ground-truth data series.

4. Dataset Diversity, Complexity, and Attribute Distribution

Diversity is a defining feature of ReChartPrompt-240K. Charts are sourced from 30,071 arXiv papers spanning machine learning, computer vision, statistics, signal processing, and related fields. The dataset encompasses broad visual, semantic, and stylistic variation:

  • 12 chart categories,
  • a wide vocabulary of text annotations and legend entries,
  • large variety in color palettes (named and hex codes),
  • substantial heterogeneity in layouts, aspect ratios, marker types, and grids.

Unique-attribute counts for text, numeric content, color, and layout tokens are substantially higher in ReChartPrompt-160K (a representative 160K subset) than in earlier datasets such as Chart2Code-160K. While no single diversity index is reported, these raw counts illustrate the compositional complexity, with marked increases in textual and structural attributes.

5. Downstream Utilization and Evaluation Protocols

ReChartPrompt-240K is used for supervised fine-tuning (SFT) and reinforcement learning (RL) in the training of chart-to-code models:

  • Stage 1: SFT is performed on the entire dataset (242,479 triplets) using Qwen2.5-VL-7B, with a learning rate of 2e-5, batch size 128, and one epoch.
  • Stage 2: A 10% random subset (24,000 samples) is used for RL fine-tuning with ChartSimRL, a GRPO-based RL algorithm leveraging a chart similarity reward (learning rate 5e-6, with M=4 code candidates per sample).

Downstream evaluation is conducted on held-out, manually curated benchmarks (ChartMimic, Plot2Code, ChartX), whose constituent papers were explicitly excluded from ReChartPrompt. There is no internal train/val/test split within the dataset itself; all triplets are used for model fitting, and generalization is assessed externally.

6. Attribute Extraction and Similarity Metrics

ChartSimRL, the RL pipeline applied post-SFT, incorporates metrics derived from attribute and visual similarity between generated and reference chart/code pairs. Attribute similarity is measured using a Jaccard index over sets of semantic attributes:

Riattr=AiAAiAR^{\mathrm{attr}}_i = \frac{|\mathcal{A}_i \cap \mathcal{A}^*|}{|\mathcal{A}_i \cup \mathcal{A}^*|}

where Ai=G(I^i,Oi)\mathcal{A}_i = G(\hat I_i, O_i) and A=G(I,Y)\mathcal{A}^* = G(I, Y). Visual similarity is calculated as the cosine similarity over CNN-derived feature vectors for four selected layers, averaged per sample:

Rivis=14k=14cos(fk,f^k(i))R^{\mathrm{vis}}_i = \frac{1}{4}\sum_{k=1}^{4} \cos(\mathbf{f}_k, \hat{\mathbf{f}}^{(i)}_k)

with features fk\mathbf{f}_k from Fk(I)\mathcal{F}_k(I) and TiT_i0 from TiT_i1 (where TiT_i2 extracts CNN features and TiT_i3 is the image generated by executing TiT_i4). These metrics not only serve as reinforcement signals during RL, but also quantify the dataset's multimodal richness and the gap between code-based and perceptual chart reproduction.

7. Dataset Impact and Intended Use

ReChartPrompt-240K addresses two fundamental bottlenecks in chart-to-code research: insufficient data diversity and lack of multimodal consistency supervision. By leveraging real, human-designed charts instead of synthetic seeds, it exposes models to the full variability encountered in practical scientific communication. Its scale and heterogeneity facilitate both general-purpose SFT and targeted RL for improving visual-semantic fidelity. The open protocol—comprising arXiv harvesting, automated classification, prompt-driven code generation, and rigorous filtering—provides a reproducible, extensible blueprint for future large-scale, multimodal dataset creation. All resources, including code, are publicly available at https://github.com/WentaoTan/ChartMaster (Tan et al., 25 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ReChartPrompt-240K.