PandasPlotBench: LLM Visualization Benchmark

Updated 30 June 2025

PandasPlotBench is a human-curated benchmark that assesses LLMs' capability to translate natural language instructions into accurate plotting code for data visualizations.
It comprises 175 real-world tasks with diverse challenges including data aggregation, facet plotting, and multi-series charting using curated CSV datasets and reference Matplotlib scripts.
The evaluation protocol leverages visual and task scoring along with code correctness metrics to benchmark performance across different plotting libraries like Matplotlib, Seaborn, and Plotly.

PandasPlotBench is a human-curated benchmark explicitly constructed to assess the ability of LLMs to generate correct and visually faithful code for data visualization tasks involving Pandas DataFrames. The benchmark is structured to systematically evaluate LLMs’ capacity to translate natural language instructions into plotting code using major Python libraries, and is designed with practical, extensible evaluation and detailed ground-truth comparisons at its core.

1. Definition and Purpose

PandasPlotBench is designed to evaluate LLMs as visual data exploration assistants, focusing on the generation of plotting code from natural language prompts and tabular data. Distinct from code completion or functional correctness benchmarks, it tests whether models can (a) interpret user instructions of varying specificity, (b) correctly manipulate Pandas DataFrames, and (c) use plotting APIs to create visualizations matching task descriptions both semantically and visually. The benchmark’s primary goal is to provide a rigorous, reproducible, and extensible evaluation protocol for current and future generations of LLMs tasked with scientific and exploratory analytics.

2. Dataset Composition and Structure

The dataset underlying PandasPlotBench comprises 175 unique tasks, each derived from real plotting scenarios. Data generation leverages Matplotlib gallery scripts, which are manually split into a data preparation segment (with data serialized as CSV and loaded into a DataFrame) and a plotting segment (reference Matplotlib code and corresponding plot). Each benchmark task consists of:

A data file (CSV) containing tabular data relevant to the plotting task.
Minimal loader code for constructing a DataFrame.
The reference plotting code (typically Matplotlib-based), verified by human curation.
A ground-truth plot image.
Three natural language instructions (“detailed”, “short”, “single-sentence”) capturing varying prompt verbosity and completeness.

Tasks are selected and pruned to span a diverse range of real-world challenges: summarization, aggregation, facet plotting, multi-series and categorical charting, and various stylization requirements.

3. Evaluation Methodology

Evaluation consists of executing code generated by LLMs in response to task prompts and data, rendering the resulting visualizations in an automated environment. The assessment involves:

Visual Score: Using the GPT-4o multimodal model, the generated plot is compared against the reference image; the score (0–100) measures primary visual and semantic correspondence, not exact pixel similarity.
Task Score: The same model is prompted with the task to judge whether the output plot fulfills the written instructions (0–100), examining successful task completion beyond superficial similarity.
Code Correctness: The proportion of tasks for which code execution fails at runtime.
Library Compliance: Whether the generated code uses the specified plotting library (Matplotlib, Seaborn, Plotly).
Good Score Ratio: Fraction of samples with scores ≥75, for both Visual and Task criteria.

A typical benchmark run involves submitting all task variants (each with its three prompt versions) to LLMs, collecting the outputs, and conducting automated and corroborated human evaluation, with strong Pearson correlation (0.85) reported between model-based and human judgments.

4. Benchmark Findings and Model Leaderboard

Results reveal significant variation in LLM plotting performance across libraries, models, and prompt verbosity:

GPT-4o and Claude 3.5 Sonnet demonstrate the highest task and visual scores, with Matplotlib and Seaborn being the most robustly supported libraries and Plotly presenting substantial challenges (22% incorrect code rate).
Prompt Length Effects: LLMs perform nearly as well with concise single-sentence prompts as with detailed instructions, provided prompts remain well-formed. Complete omission of the task description results in a collapse of model performance.
Open-Source LLMs: Large variants (e.g., Llama 3.1 405B) achieve competitive accuracy, although small models lag considerably.
Visualization Library: The leaderboard evidences substantially poorer performance on Plotly across all models, attributable to less comprehensive training and more complex API semantics compared to Matplotlib and Seaborn.

Model	Incorrect Code (%)	Visual Score	Task Score
GPT-4o	1.8	75	89
Claude 3.5 Sonnet	2.5	73	88
Gemini 1.5 Pro	1.7	71	81
Llama 3.1 405B	2.9	73	86

Library	Incorrect Code (%)	Visual Score	Task Score
Matplotlib	1.8	75	89
Seaborn	5.2	67	84
Plotly	22.0	59	68

5. Methodological Innovations and Evaluation Protocols

PandasPlotBench introduces several methodological features critical for robust LLM evaluation:

Instructional Modularity: By providing each plotting task in multiple linguistic forms (detailed, short, single-sentence), the dataset supports controlled studies in prompt engineering and the assessment of model robustness to varying user input.
Library Generalization: Each prompt can be instantiated for alternative Python plotting libraries. While Matplotlib is the primary ground truth for output images, models are instructed to use the experiment-specified library, thereby measuring both code generation fidelity and library competence.
Visual Comparison: Automated visual similarity scoring by state-of-the-art multimodal models mirrors user acceptance criteria, focusing on main data story and semantic mapping, rather than strict graphical reproduction.
Execution-Driven Assessment: The benchmark distinguishes between code correctness (runnable outputs) and visualization appropriateness (matching task semantics), providing a nuanced account of model reliability in practical toolchains.

Notably, the benchmark and its codebase are designed for modular extensions to new plotting frameworks and programming languages, as well as for augmentation with additional task types.

6. Comparative Context and Research Extensions

PandasPlotBench is positioned among a growing ecosystem of benchmarking tools measuring LLM and dataframe library utility. It complements, but is distinct from, PandasBench, which focuses on general Pandas API usage for data wrangling, eschewing visualization. PandasPlotBench emphasizes:

Grounded end-to-end evaluation from prompt receipt to rendered plot.
Measurement of both syntactic and semantic errors specific to plotting, such as invalid API usage, incomplete line series, malformed facets, or incorrect color mappings.
Facilitation of prompt engineering and user experience research through its split-instruction design.

Recent work (e.g., VisCoder (Ni et al., 4 Jun 2025)) demonstrates that training on datasets containing multi-turn feedback and correction dialogues enables LLMs to outperform untuned baselines on PandasPlotBench tasks, particularly through self-debug protocols that simulate realistic code repair cycles.

7. Current Limitations and Prospective Directions

While PandasPlotBench constitutes a rigorous evaluation resource, certain limitations remain:

Plotly Underrepresentation: Higher failure rates reveal a bottleneck in LLMs’ training data and their mastery over less canonical plotting libraries.
Reference Output: Only Matplotlib results are exhaustively ground-truthed, necessitating semantic evaluation rather than pixel-to-pixel for other libraries.
Dataframe Realism: Benchmark DataFrames are tailored to task requirements and remain relatively concise; broader incorporation of noisy or heterogeneous data is warranted.
Scalability of Human Judgment: While early manual validation supports automated metrics, more extensive expert review could increase result robustness.
Coverage: The benchmark presently focuses on Python’s main plotting libraries; systematic porting to R, Julia, or other ecosystems remains an open area.

The modular structure of PandasPlotBench supports ongoing augmentation, both in terms of new plotting techniques and the investigation of LLM prompt-design best practices for data visualization.

PandasPlotBench, available at https://huggingface.co/datasets/JetBrains-Research/plot_bench, represents a foundational contribution to empirical LLM evaluation in code-driven visual analytics and is an evolving toolset driving measurable progress in the fidelity and utility of AI-assisted data exploration.

PDF Markdown Chat (Pro)

References (1)

VisCoder: Fine-Tuning LLMs for Executable Python Visualization Code Generation (2025)

Follow Topic

Get notified by email when new papers are published related to PandasPlotBench.