PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?

Published 9 May 2026 in cs.DB and cs.AI | (2605.08687v1)

Abstract: Data preparation is a central and time-consuming stage in data analysis workflows. Traditionally, commercial tools have relied on graphical user interfaces (GUIs) to simplify data preparation, allowing users to define transformations through visual operators and workflows. Recent advances in LLMs raise the possibility of a paradigm shift toward natural language (NL)-driven data preparation, in which users can specify preparation intents in NL directly. However, it remains unclear how far current LLM-based agents are from this paradigm shift in practice. Existing code generation benchmarks do not capture key characteristics of data preparation, including ambiguous user intents, imperfect real-world data, and the need to translate code into interpretable workflows for validation. To bridge this gap, we present PrepBench, a benchmark designed to evaluate NL-driven data preparation along three core capabilities: interactive disambiguation, prep-code generation, and code-to-workflow translation. We crawl data from the Preppin' Data Challenges, and then extend it into a systematically designed benchmark. The benchmark covers diverse domains, and each task involves 3 to 18 data preparation steps. Nearly half of the tasks require over 100 lines of Python code, and the longest solutions approach 300 lines. Our evaluation shows that, despite recent progress, realizing this paradigm shift remains challenging for state-of-the-art LLMs. PrepBench provides a principled benchmark for measuring this gap and helps identify key challenges toward realizing NL-driven data preparation.

Abstract PDF Upgrade to Chat

Authors (4)

Summary

The paper presents PrepBench, a benchmark exposing key gaps in NL-driven data preparation, with interactive disambiguation F1 scores below 51.4 and end-to-end accuracies around 54.9%.
It systematically converts 306 real-world data transformation tasks from 32 domains into benchmark evaluations, featuring solutions with up to 300 lines of code.
The study highlights that enhancing disambiguation and integrating LLM code synthesis with compiler-based workflow extraction are critical for robust data preparation automation.

Systematic Benchmarking of Natural-Language-Driven Data Preparation: An Essay on "PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?"

Introduction

The automation of data preparation remains a central challenge in data analysis workflows, often consuming the majority of analysts' effort prior to modeling or analytics. Commercial tools (e.g., Tableau Prep, SAS Data Preparation) employ graphical user interfaces (GUIs) to lower the barrier, but transitioning from natural language (NL) user intents to GUI workflows remains indirect, ambiguous, and error-prone. Recent advances in LLMs have suggested a seemingly imminent paradigm shift: users would specify data preparation needs in NL and LLM-based agents would directly produce executable solutions. However, the practical efficacy and maturity of this paradigm are fundamentally underexplored due to a lack of realistic, systematic evaluation infrastructures.

This work introduces "PrepBench," a comprehensive benchmark explicitly designed to evaluate NL-driven data preparation. PrepBench scrutinizes three core system capabilities critical to realistic deployment: interactive disambiguation, robust prep-code generation, and code-to-workflow translation. The resulting analysis provides rigorous quantitative insights into current LLM performance, sources of failure, and practical implications for the next generation of AI-driven data infrastructure.

The Benchmark: Scope and Design

PrepBench is built by systematically converting weekly Preppin' Data Challenges—complex, multi-step data transformation tasks used in Tableau Prep training—into benchmark tasks. The resulting corpus consists of 306 tasks across 32 domains, covering 829 input tables, with task complexity ranging from 3 to 18 transformation steps and solutions reaching up to 300 lines of code. Each task is enriched with:

Original NL Request: As encountered in practical scenarios, often fraught with ambiguity and under-specification.
Disambiguated Request & Knowledge Base: Unambiguous task specification derived through iterative rewriting and validation. Additionally, a structured knowledge base exhaustively documents all ambiguity cases and their resolutions.
Ground-Truth Code and Workflow: Reference Python (Pandas) code and operator-based workflow graphs, ensuring both executable correctness and human interpretability.

The benchmark supports three execution modes for evaluating systems in isolation and end-to-end: (1) interactive clarification for ambiguity resolution, (2) NL-to-code generation (optionally with profiling of irregular raw data), and (3) translation from code to workflow for GUI-based verification.

Three Core Capabilities for NL-Driven Data Preparation

Interactive Disambiguation

Realistic data preparation requests are frequently ambiguous regarding intent, semantics, and edge-case handling. PrepBench operationalizes this evaluation by providing a user simulator that automatically returns clarification responses based on the disambiguation knowledge base. The taxonomy of ambiguities includes data interpretation (e.g., join fields, column mappings), concept interpretation (e.g., group/row-level concept underspecification), and operational ambiguities (e.g., incomplete/inconsistent rules, threshold edge cases). Current models demonstrate significant limitations in both detecting and resolving these ambiguities. Disambiguation $F_1$ scores do not exceed 51.4 for state-of-the-art models, and critical ambiguities—especially in multi-table alignment—remain challenging.

Prep-Code Generation

NL requests, even when disambiguated, must be translated into robust, executable code that handles real-world data irregularities (format variants, header/row glitches, misspellings, inconsistent missing-value representations). PrepBench requires systems to profile the full input tables and integrate this signal into code generation. Major findings show substantial accuracy improvements when ambiguity is removed (e.g., for GPT-5.1-Codex, accuracy rises from 54.9% to 85.3%), but robustness to data irregularity is still not reliably achieved except for the most advanced models.

Code-to-Workflow Translation

For broader accessibility and verification, generated code must map to GUI-based, operator DAG workflows compatible with visual data preparation platforms. The translation requires accurate mapping to a schema-constrained DSL, which is outside typical LLM pretraining regimes. The best-performing model achieves at most 67.7% accuracy on executable workflow generation, with most failures due to the production of non-executable (not incorrect) workflows. The problem compounds with workflow length and procedural complexity.

Experimental Findings

The extensive empirical evaluation of ten recent LLMs (both proprietary and open-weight) on PrepBench yields several key results:

End-to-End Accuracy is Limited: The best current agent (GPT-5.1-Codex) achieves only 54.9% on complex, end-to-end NL-driven data preparation tasks.
Ambiguity is the Dominant Source of Error: Post-disambiguation, accuracy jumps dramatically, underscoring the need for models that seek clarification rather than guessing.
Interactive Disambiguation Has Potential but Remains Ineffective: Most models’ clarification strategies are incomplete and often target the wrong ambiguities, with interaction gains strongly dependent on question quality rather than volume.
Code-to-Workflow Translation Remains a Major Bottleneck: Transitioning from free-form code to rigid, operator-based workflow graphs presents a systematic challenge, with many translated workflows syntactically invalid.
Model Cost/Performance Mismatch: Higher cost does not universally translate to better results; in some cases, lighter models yield comparable or even superior performance-cost tradeoffs.

Implications, Limitations, and Future Directions

The results establish a quantitative, task-grounded gap between current LLM capabilities and practical, robust NL-driven data preparation. While code synthesis has advanced, full-stack autonomy in realistic settings is unattained. Notably:

Disambiguation as a First-Class Problem: Explicit agent architectures that prioritize clarification (with reward modeling/RL for clarification behavior) and intermediate structured representations (explicit intent specifications) are positioned to close the ambiguity gap.
Profiling and Robustness: Better integration of data profiling and code generation—possibly via iterative refinement or synthesized runtime assertions—may improve resilience to the spectrum of real-world irregularities.
Workflow Generation Research: There's a need for hybrid systems pairing LLM code synthesis with compiler-based workflow extraction, human-readable explanations, and interactive error localization.
User Profiles and Preference Modeling: Reusable user profiles encoding preference for ambiguous operations offer a scalable way to reduce repeated clarification without requiring hardcoded global defaults, potentially reducing friction in workflow authoring.
Extension Beyond Tabular Data: The paradigm and evaluation methodology generalize to other data types (semi/unstructured) but require new operators and representations for extraction, annotation, and nested structure manipulation.

Conclusion

"PrepBench: How Far Are We from Natural-Language-Driven Data Preparation?" introduces a principled, systematic framework for benchmarking NL-driven data preparation and clarifies the multi-faceted limitations posed by both ambiguous natural language and the diversity of real-world data complexities. The released benchmark and analysis will inform research on NL-to-action agents, robust code synthesis, agent-based disambiguation, and workflow generation. The field must address disambiguation, robustness, and workflow generation with novel architectural, training, and system design contributions to realize the promise of NL-driven data analytics automation.