Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach (2506.06175v1)

Published 6 Jun 2025 in cs.CL

Abstract: LLMs can translate natural-language chart descriptions into runnable code, yet approximately 15\% of the generated scripts still fail to execute, even after supervised fine-tuning and reinforcement learning. We investigate whether this persistent error rate stems from model limitations or from reliance on a single-prompt design. To explore this, we propose a lightweight multi-agent pipeline that separates drafting, execution, repair, and judgment, using only an off-the-shelf GPT-4o-mini model. On the \textsc{Text2Chart31} benchmark, our system reduces execution errors to 4.5\% within three repair iterations, outperforming the strongest fine-tuned baseline by nearly 5 percentage points while requiring significantly less compute. Similar performance is observed on the \textsc{ChartX} benchmark, with an error rate of 4.6\%, demonstrating strong generalization. Under current benchmarks, execution success appears largely solved. However, manual review reveals that 6 out of 100 sampled charts contain hallucinations, and an LLM-based accessibility audit shows that only 33.3\% (\textsc{Text2Chart31}) and 7.2\% (\textsc{ChartX}) of generated charts satisfy basic colorblindness guidelines. These findings suggest that future work should shift focus from execution reliability toward improving chart aesthetics, semantic fidelity, and accessibility.

Summary

The paper introduces a multi-agent framework that reduces execution errors significantly, lowering rates from 15% to 4.5% on benchmark datasets.
It employs dedicated agents for drafting, repairing, and judging code, streamlining the text-to-chart generation process without extra training.
The study highlights remaining challenges such as semantic hallucinations and limited accessibility adherence, urging a holistic evaluation approach.

Analyzing Multi-Agent Approaches in Text-to-Chart Generation

The paper "Does It Run and Is That Enough? Revisiting Text-to-Chart Generation with a Multi-Agent Approach" offers a critical evaluation of the efficacy of LLMs in the domain of text-to-chart generation by examining execution error rates and proposing novel methodologies to address persistent challenges. This paper explicitly challenges the reliance on a singular LLM prompt by introducing a multi-agent interrogation of this generation process, which bifurcates responsibilities such as drafting, execution, repair, and judgment.

Contribution and Methodology

The researchers introduce a streamlined multi-agent framework leveraging the GPT-4o-mini model without necessitating further model training. This framework compartmentalizes the chart generation task into designated phases: a Draft agent initiates code drafting based on natural-language prompts, and a Repair agent iteratively debugs this code in instances where the execution process fails. Notably, this approach considerably reduces execution errors from the previously recorded rates of approximately 15% to a new low of 4.5% on the Text2Chart31 dataset. This performance was achieved with significantly reduced computational demands, thereby validating the effectiveness of their agentic methodology over established fine-tuned base models.

Results and Analysis

The paper notices a substantial reduction in the failure rates with the agentic approach outperforming state-of-the-art fine-tuned baselines by nearly 5 percentage points on both Text2Chart31 and ChartX benchmarks. This suggests that traditional benchmarks in execution success appear largely conquered by their methodology. However, the authors reveal that semantic errors persist—6 out of 100 generated charts contained hallucinations. Furthermore, accessibility deficiencies were observed, as only a small fraction of generated charts (33.3% for Text2Chart31 and 7.2% for ChartX) aligned with colorblindness guidelines.

Theoretical and Practical Implications

The research insists on revisiting the evaluation criteria in text-to-chart generation. By exposing the inadequacy of current benchmarks that emphasize purely execution-related metrics, it suggests a shift towards assessing the semantic and stylistic fidelity, overall visual quality, and accessibility features of generated charts. The agentic system demonstrates potential for tackling such qualitative metrics through structured reasoning. Moreover, the paper emphasizes the need for future research to consider real-world application scenarios, focusing on usability, readability, and inclusivity.

Future Directions

Looking forward, the authors urge the community to explore new metrics that go beyond a binary metric of whether code "runs or not" but rather also assess the semantic and perceptual quality of the output. This approach could include structural similarity (SSIM) scores and evaluations using multimodal LLMs. The agentic systems, having demonstrated robustness and improved performances, present an interesting avenue for further exploration, especially when assessing complex multi-step analytical sessions or datasets containing noise.

In closing, the paper's findings underscore a necessary transition towards holistic evaluations encompassing both the functional and qualitative aspects of automated chart generation processes. This multi-agent strategy presents an effective paradigm shift, balancing performance and resource efficiency. As a result, the paper contributes substantially to the roadmap of advancing AI in automated data visualization, particularly in accommodating professional and accessibility needs.

PDF Markdown