Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
134 tokens/sec
GPT-4o
9 tokens/sec
Gemini 2.5 Pro Pro
47 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VisCoder: Visual Code Generation & Debugging

Updated 30 June 2025
  • VisCoder is a comprehensive framework that fuses visual data processing, code generation, and self-correction to create executable Python visualization code.
  • It utilizes the large-scale VisCode-200K dataset and multi-turn dialogue feedback to iteratively refine plotting scripts from natural language inputs.
  • Fine-tuning on models like Qwen2.5-Coder-Instruct, VisCoder achieves high execution pass rates, enhancing reliability in real-world visualization tasks.

VisCoder refers to a suite of technologies and research advances centered on the intersection of visual data, coding, and code generation. Recent literature applies the term to diverse domains, including visualization information embedding, visualization code synthesis, and vision-integrated coding frameworks. The following provides a detailed overview of VisCoder as the name for a family of approaches, with a focus on fine-tuning LLMs for executable Python visualization code generation (2506.03930).

1. Instruction-Tuning Dataset: VisCode-200K

VisCode-200K is a large-scale dataset specifically constructed for high-fidelity visualization code generation and iterative self-correction. It contains over 200,000 examples, sourced from open repositories and code-feedback dialogues:

  • Executable Visualization Code from Repositories:
    • Sourced from datasets such as Stack-Edu and CoSyn-400K.
    • Programmatic filtering targets plotting code using matplotlib, seaborn, and plotly.
    • Each example is rendered and validated in a Jupyter environment; only those passing execution and visual output checks are retained (e.g., 105k from stack-edu, 50k from CoSyn-400K).
    • Mock data replacement ensures that code blocks are independently runnable; data previews are included where feasible.
    • For each plot, a LLM (GPT-4o-mini) generates a structured natural language instruction, partitioned into setup, data description, plot description, and style details.
  • Multi-turn Correction Dialogues (Code-Feedback):
    • 45k dialogue chains extracted from the Code-Feedback dataset capture the iterative process of debugging Python code—critical for learning error-driven code repair (although not exclusive to visualization code).
    • Each dialogue includes runtime errors and subsequent code edits, allowing models to learn how to revise faulty code based on actual feedback.

This combined resource supports models in learning both executable code generation—grounded in real visual outputs—and robust self-correction through feedback.

2. Model Development and Training Paradigm

The VisCoder model is obtained by full-parameter fine-tuning of Qwen2.5-Coder-Instruct at both the 3B and 7B parameter scales:

  • Training Configuration:
    • Three epochs, learning rate 5×1065 \times 10^{-6}, cosine decay, warm-up ratio 0.05.
    • Full-parameter update (not using adapters or LoRA), batch size 128, executed on 8×A100 GPUs.
    • Data mixes single-turn instruction completion (from validated code/executable pairs) with multi-turn correction (from code-feedback dialogues), requiring models to handle both initial code synthesis and iterative code repair.
  • Prompt Engineering:
    • Inputs are formatted to unify problem description, code, data fragment, and error context.
    • For code correction rounds, previous code, error messages, and conversational context are included.

This results in a model attuned to both synthesizing correct plotting code from natural language requirements and debugging its own outputs based on error traces.

3. Evaluation Protocols and Benchmarks

Evaluation of VisCoder focuses on both execution reliability and the semantic/visual correctness of the generated plots. The main benchmark adopted is PandasPlotBench:

  • Metrics Used:
    • Execution Pass Rate: Percentage of plots rendered without error.
    • Incorrect Code Rate: Fraction of results failing to return a plot.
    • GPT-4o-judged Task and Visual Scores: Semantic alignment with the original task and visual output similarity, respectively (on a 0–100 scale).
    • Good-at-75: Proportion of outputs scoring at least 75 in semantic or visual alignment.
  • Self-Debug Evaluation Protocol:
    • For failed tasks, up to three iterative self-debug rounds are initiated.
    • Each round involves re-prompting the model with NL instruction, prior code, and error trace.
    • Execution is checked after each iteration; successful corrections are retained.
    • This protocol explicitly measures the model’s feedback-driven repair capability, key for real-world data workflows.

4. Results and Comparative Performance

  • VisCoder-3B and VisCoder-7B outperform untuned and baseline open-source models by 11–36 percentage points in pass rates.
  • VisCoder-7B achieves benchmark pass rates of 87.4% (matplotlib), 76.6% (seaborn), and 74.3% (plotly), surpassing GPT-4o-mini on seaborn and plotly, and approaching proprietary GPT-4o’s performance when iterative self-debugging is enabled.
  • Under the self-debug protocol, VisCoder-7B reaches >90% pass rate across matplotlib and seaborn after iterative correction, significantly reducing error rates through multi-turn feedback.
Model Matplotlib Seaborn Plotly
VisCoder-7B 87.4% 76.6% 74.3%
GPT-4o 94.9% 83.4% 87.4%
GPT-4o-mini 81.7% 62.3% 69.1%
Qwen2.5-7B 78.3% 58.3% 48.0%

Improvement in code robustness and semantic consistency is attributed to the mixture of validated executable code and dialogue-driven repair supervision.

5. Methodological Contributions and Design

  • Execution-Grounded Supervision: All examples in the training set are validated in a real runtime, ensuring that the models not only produce plausible code but code that executes correctly and yields the intended plot.
  • Feedback-Driven Iteration: Incorporation of multi-turn code feedback dialogue enables the model to parse runtime errors and repair faulty code, closely emulating the iterative nature of human coding.
  • Prompt Structuralization: The use of distinct prompt sections for setup, data, plot, and style ensures that the model is exposed to both semantic and syntactic requirements of visualization scripting.

The following pseudocode (adapted from the paper) summarizes the self-debug loop:

1
2
3
4
5
6
7
8
F0 = set of failed tasks from initial eval
for i in range(1, K+1):
    for x in Fi-1 not yet fixed:
        fix_x = VisCoder.fix(x, error_trace, previous_code)
        if fix_x executes successfully:
            mark x as fixed
        else:
            record failure

6. Future Implications and Research Directions

  • Extensibility: The paradigm of execution-and-feedback-grounded code synthesis can be broadened to support additional languages (R, JavaScript) and visualization libraries (ggplot, D3.js).
  • Human-in-the-Loop Debugging: VisCoder’s feedback-driven correction suggests potential for integration into collaborative or agentic code environments.
  • Enhanced Evaluation: Further development of metrics and benchmarks could involve more direct semantic and visual plot alignment and user-in-the-loop assessments.
  • Agentic Data Analytics: The underlying mixture of code generation and iterative self-correction forms the foundation for future autonomous data analyst agents.

7. Summary and References

VisCoder, as described in the 2025 paper, represents a practical realization of LLM-based visualization code generation that is robust to execution errors and semantically grounded. Through the introduction of VisCode-200K and a multi-turn, feedback-driven fine-tuning approach, it establishes new state-of-the-art results for open-source models on PandasPlotBench, approaching or matching proprietary models in practical performance. All dataset details, evaluation setups, and code structures are released by the authors for community benchmarking and extension.

For specifics on training details, dataset structure, and evaluation procedures, refer to the VisCoder project page: https://tiger-ai-lab.github.io/VisCoder/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)