- The paper presents PrepBench, a benchmark exposing key gaps in NL-driven data preparation, with interactive disambiguation F1 scores below 51.4 and end-to-end accuracies around 54.9%.
- It systematically converts 306 real-world data transformation tasks from 32 domains into benchmark evaluations, featuring solutions with up to 300 lines of code.
- The study highlights that enhancing disambiguation and integrating LLM code synthesis with compiler-based workflow extraction are critical for robust data preparation automation.
Systematic Benchmarking of Natural-Language-Driven Data Preparation: An Essay on "PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?"
Introduction
The automation of data preparation remains a central challenge in data analysis workflows, often consuming the majority of analysts' effort prior to modeling or analytics. Commercial tools (e.g., Tableau Prep, SAS Data Preparation) employ graphical user interfaces (GUIs) to lower the barrier, but transitioning from natural language (NL) user intents to GUI workflows remains indirect, ambiguous, and error-prone. Recent advances in LLMs have suggested a seemingly imminent paradigm shift: users would specify data preparation needs in NL and LLM-based agents would directly produce executable solutions. However, the practical efficacy and maturity of this paradigm are fundamentally underexplored due to a lack of realistic, systematic evaluation infrastructures.
This work introduces "PrepBench," a comprehensive benchmark explicitly designed to evaluate NL-driven data preparation. PrepBench scrutinizes three core system capabilities critical to realistic deployment: interactive disambiguation, robust prep-code generation, and code-to-workflow translation. The resulting analysis provides rigorous quantitative insights into current LLM performance, sources of failure, and practical implications for the next generation of AI-driven data infrastructure.
The Benchmark: Scope and Design
PrepBench is built by systematically converting weekly Preppin' Data Challenges—complex, multi-step data transformation tasks used in Tableau Prep training—into benchmark tasks. The resulting corpus consists of 306 tasks across 32 domains, covering 829 input tables, with task complexity ranging from 3 to 18 transformation steps and solutions reaching up to 300 lines of code. Each task is enriched with:
- Original NL Request: As encountered in practical scenarios, often fraught with ambiguity and under-specification.
- Disambiguated Request & Knowledge Base: Unambiguous task specification derived through iterative rewriting and validation. Additionally, a structured knowledge base exhaustively documents all ambiguity cases and their resolutions.
- Ground-Truth Code and Workflow: Reference Python (Pandas) code and operator-based workflow graphs, ensuring both executable correctness and human interpretability.
The benchmark supports three execution modes for evaluating systems in isolation and end-to-end: (1) interactive clarification for ambiguity resolution, (2) NL-to-code generation (optionally with profiling of irregular raw data), and (3) translation from code to workflow for GUI-based verification.
Three Core Capabilities for NL-Driven Data Preparation
Interactive Disambiguation
Realistic data preparation requests are frequently ambiguous regarding intent, semantics, and edge-case handling. PrepBench operationalizes this evaluation by providing a user simulator that automatically returns clarification responses based on the disambiguation knowledge base. The taxonomy of ambiguities includes data interpretation (e.g., join fields, column mappings), concept interpretation (e.g., group/row-level concept underspecification), and operational ambiguities (e.g., incomplete/inconsistent rules, threshold edge cases). Current models demonstrate significant limitations in both detecting and resolving these ambiguities. Disambiguation F1​ scores do not exceed 51.4 for state-of-the-art models, and critical ambiguities—especially in multi-table alignment—remain challenging.
Prep-Code Generation
NL requests, even when disambiguated, must be translated into robust, executable code that handles real-world data irregularities (format variants, header/row glitches, misspellings, inconsistent missing-value representations). PrepBench requires systems to profile the full input tables and integrate this signal into code generation. Major findings show substantial accuracy improvements when ambiguity is removed (e.g., for GPT-5.1-Codex, accuracy rises from 54.9% to 85.3%), but robustness to data irregularity is still not reliably achieved except for the most advanced models.
Code-to-Workflow Translation
For broader accessibility and verification, generated code must map to GUI-based, operator DAG workflows compatible with visual data preparation platforms. The translation requires accurate mapping to a schema-constrained DSL, which is outside typical LLM pretraining regimes. The best-performing model achieves at most 67.7% accuracy on executable workflow generation, with most failures due to the production of non-executable (not incorrect) workflows. The problem compounds with workflow length and procedural complexity.
Experimental Findings
The extensive empirical evaluation of ten recent LLMs (both proprietary and open-weight) on PrepBench yields several key results:
- End-to-End Accuracy is Limited: The best current agent (GPT-5.1-Codex) achieves only 54.9% on complex, end-to-end NL-driven data preparation tasks.
- Ambiguity is the Dominant Source of Error: Post-disambiguation, accuracy jumps dramatically, underscoring the need for models that seek clarification rather than guessing.
- Interactive Disambiguation Has Potential but Remains Ineffective: Most models’ clarification strategies are incomplete and often target the wrong ambiguities, with interaction gains strongly dependent on question quality rather than volume.
- Code-to-Workflow Translation Remains a Major Bottleneck: Transitioning from free-form code to rigid, operator-based workflow graphs presents a systematic challenge, with many translated workflows syntactically invalid.
- Model Cost/Performance Mismatch: Higher cost does not universally translate to better results; in some cases, lighter models yield comparable or even superior performance-cost tradeoffs.
Implications, Limitations, and Future Directions
The results establish a quantitative, task-grounded gap between current LLM capabilities and practical, robust NL-driven data preparation. While code synthesis has advanced, full-stack autonomy in realistic settings is unattained. Notably:
- Disambiguation as a First-Class Problem: Explicit agent architectures that prioritize clarification (with reward modeling/RL for clarification behavior) and intermediate structured representations (explicit intent specifications) are positioned to close the ambiguity gap.
- Profiling and Robustness: Better integration of data profiling and code generation—possibly via iterative refinement or synthesized runtime assertions—may improve resilience to the spectrum of real-world irregularities.
- Workflow Generation Research: There's a need for hybrid systems pairing LLM code synthesis with compiler-based workflow extraction, human-readable explanations, and interactive error localization.
- User Profiles and Preference Modeling: Reusable user profiles encoding preference for ambiguous operations offer a scalable way to reduce repeated clarification without requiring hardcoded global defaults, potentially reducing friction in workflow authoring.
- Extension Beyond Tabular Data: The paradigm and evaluation methodology generalize to other data types (semi/unstructured) but require new operators and representations for extraction, annotation, and nested structure manipulation.
Conclusion
"PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?" introduces a principled, systematic framework for benchmarking NL-driven data preparation and clarifies the multi-faceted limitations posed by both ambiguous natural language and the diversity of real-world data complexities. The released benchmark and analysis will inform research on NL-to-action agents, robust code synthesis, agent-based disambiguation, and workflow generation. The field must address disambiguation, robustness, and workflow generation with novel architectural, training, and system design contributions to realize the promise of NL-driven data analytics automation.
References
All cited references can be found in the paper's appendix and at (2605.08687).