VisCode-200K: Robust Python Visualization Code Dataset

Updated 30 June 2025

VisCode-200K is a large-scale dataset that provides instruction-tuned Python visualization code with execution validation and self-debug capabilities.
It integrates both single-turn and multi-turn correction dialogues to improve initial code synthesis and iterative error recovery.
Benchmark evaluations show significant performance gains over previous open-source models for generating and self-correcting visualization code.

VisCode-200K is a large-scale instruction tuning dataset and benchmark for robust, executable Python visualization code generation and iterative code correction. Developed within the VisCoder framework, it addresses both the need for grounded, execution-validated training data and effective self-debugging supervision, substantially advancing model performance for code-to-visualization tasks in comparison to previous open-source approaches.

1. Dataset Design and Composition

VisCode-200K comprises over 200,000 examples, structured explicitly for Python-based plotting and code repair. The dataset draws on two primary sources:

Validated Visualization Code from Open-Source Repositories
- Source datasets include the Python subset of stack-edu and the chart/table partitions of CoSyn-400K.
- Code samples are filtered by the presence of relevant visualization library imports (matplotlib, seaborn, plotly, etc.).
- Code blocks are extracted via automated tools (e.g., with GPT-4o-mini as an extractor for stack-edu), modified with mock or preview data, and tested for runtime executability by running in isolated Jupyter environments using nbconvert. Scripts that fail to execute or do not produce a plot image are removed.
- This process yields 105K validated scripts from stack-edu and 50K from CoSyn-400K.
Multi-turn Correction Dialogues from Code-Feedback
- Derived from the Code-Feedback dataset, 45,000 multi-turn dialogues are included, featuring iterative correction: each dialogue consists of an initial user instruction, an initial code response, runtime feedback (such as error messages), and one or more rounds of revised code based on new instructions or observed errors.

Examples are paired with natural language instructions generated by GPT-4o, which reference not only the code but also the rendered plot. The instructions follow a structured template comprising setup (language/library requirements), data description, a mock or preview data block, a high-level plot specification, and style details.

2. Instruction Structuring and Correction Supervision

Each VisCode-200K instance includes a rich, context-aware instruction joined with the corresponding code and, when applicable, rendered plot output. The instruction template uses discrete fields to maximize information content and explicitness:

Write Python code using matplotlib to generate a scatter plot of Age vs Salary.
The first two rows of the data are:
Age,Salary
25,50000
30,60000
Style: Use green markers with a dashed grid.

Multi-turn correction dialogues from Code-Feedback are integrated alongside single-turn instances, exposing models to realistic feedback-driven debugging. The dialogue flow typically involves a user prompt, initial code, an execution error, followed by a correction request and a code fix:

User: Plot the average score by department.
Model: (initial, buggy code)
Result: KeyError: 'department'
User: Please address the KeyError and ensure the department column is used.
Model: (bug-fixed code)

This dual structure of single-shot and incremental correction instruction ensures models are competent in first-attempt synthesis and iterative error recovery.

3. Fine-Tuning Protocol

VisCode-200K is used to fine-tune Qwen2.5-Coder-Instruct (3B and 7B variant scales) using full-parameter tuning. The protocol features:

Three epochs
Learning rate set to $5 \times 10^{-6}$
Warmup ratio of 0.05 and cosine decay scheduler
Precision: bfloat16
Hardware: 8 × NVIDIA A100 GPUs, batch size 128

Combined single-turn and multi-turn dialogue samples are jointly shuffled, ensuring that model exposure to library, task, and error diversity is preserved.

4. Runtime Validation and Self-Debug Evaluation Protocol

Each code sample is validated for runtime executability and valid plot output. For compound evaluation, the PandasPlotBench benchmark is adopted, with the following protocol and metrics:

Incorrect Code Rate: Percentage of samples not yielding an executable plot.
Task Score: Degree of semantic alignment between generated plot and instruction, as judged by GPT-4o.
Visual Score: Visual similarity (generated vs. reference plot), also GPT-4o-scored.
Execution Pass Rate: The proportion of outputs that execute successfully without errors (introduced in VisCoder).

To evaluate ability to recover from failures, the self-debug evaluation protocol iteratively prompts the model for up to three retries upon failure, each incorporating the original instruction, failed code, an error message, and a correction request:

\begin{algorithm}[!h]
\caption{Self-Debug Evaluation Protocol}
\label{alg:self-debug}
\begin{algorithmic}[1]
\STATE Let %%%%1%%%% be failed tasks from initial evaluation
\FOR{%%%%2%%%% to %%%%3%%%%}
  \FOR{each task %%%%4%%%% in %%%%5%%%% not yet fixed}
    \STATE Fix %%%%6%%%% via feedback-driven prompting
    \STATE Evaluate the result of the revised code
    \IF{successful}
      \STATE Mark %%%%7%%%% as fixed
    \ELSE
      \STATE Record %%%%8%%%%'s latest failed output
    \ENDIF
  \ENDFOR
\ENDFOR
\STATE Evaluate all tasks with final recorded outputs
\end{algorithmic}
\end{algorithm}

This protocol simulates the iterative developer debugging loop and is used to measure both initial and eventual successful generation rates.

5. Comparative Model Performance

On the PandasPlotBench suite, VisCoder (a Qwen2.5-Coder-Instruct model fine-tuned on VisCode-200K) achieves:

Execution pass rates that rival or surpass proprietary models such as GPT-4o-mini on seaborn and plotly, and approach parity on matplotlib (e.g., 87.4% for VisCoder-7B on matplotlib, rising to 91.4% in self-debug mode).
Higher rates of semantically and visually "Good" outputs ( $\geq 75$ GPT-4o metric) than open-source baselines, especially for complex or compound library usage.
Ablation experiments show that multi-turn correction dialogues and library-diverse instruction both contribute substantially to robustness and recovery from errors.

6. Applications and Implications

VisCode-200K sets a new standard for executable, feedback-grounded visualization instruction tuning, enabling several applications:

Conversational data analysis agents with robust visualization generation and code repair capabilities.
Educational assistants that can stepwise repair plotting code, demonstrating both synthesis and debugging.
Automated reporting and analytics tools that require generating, validating, and, if needed, self-correcting code for diverse visualization tasks across matplotlib, seaborn, and plotly libraries.

The explicit inclusion of multi-turn debugging, aggressive runtime validation, and domain-specific instruction templating yields models that generalize well, are robust to runtime errors, and approach the practicality of proprietary solutions.

7. Outlook

The release of VisCode-200K introduces a broader paradigm for executable code instruction tuning datasets:

Emphasis on runtime validation and visually-grounded instruction, moving beyond synthetic or purely static code datasets.
Scalable protocols for feedback-driven error correction, facilitating high-value code intelligence for both human-in-the-loop and autonomous systems.
Benchmarking and evaluation methods (such as the self-debug protocol) that tightly couple model development to real-world deployment conditions for code synthesis.

For further detail, see the VisCoder project and PandasPlotBench benchmark documentation.

PDF Markdown Chat (Upgrade)