- The paper introduces a novel task that translates natural language instructions into DSL programs, enabling automated data preparation pipelines.
- It employs a multi-stage framework, including table curation, operator chain mining, instruction generation, and rule-based code compilation validated on the PARROT benchmark.
- Evaluation shows that the iterative Pipeline-Agent, leveraging execution feedback, significantly outperforms zero-shot and direct code generation methods in accuracy.
Data preparation (DP) is a crucial step in data management, transforming raw data into a format suitable for downstream tasks like business intelligence and machine learning. This process typically involves composing various operations into executable pipelines, which requires significant programming expertise and is often time-consuming. To lower this technical barrier, the paper "Text-to-Pipeline: Bridging Natural Language and Data Preparation Pipelines" (2505.15874) introduces a novel task: translating natural language (NL) instructions directly into executable data preparation pipelines.
The paper formalizes the Text-to-Pipeline task as generating symbolic programs in a domain-specific language (DSL) that can be compiled into backend code like Pandas or SQL. This DSL approach offers advantages over direct code generation, including stable structure, backend flexibility, and better support for planning and verification. An instance of this task involves an input table, an NL instruction, a target output table, and the ground truth DSL program. The objective is to generate the DSL program which, when compiled and executed on the input table, produces the target output table. Evaluation metrics include Execution Accuracy (EA), Program Validity (PV), and Operator Accuracy (OA).
To support this new task, the authors develop PARROT, a large-scale benchmark comprising approximately 18,000 multi-step DP tasks. PARROT is built upon 23,009 real-world tables sourced from six public datasets, spanning diverse domains and table structures. The benchmark construction follows a five-stage synthesis framework:
- Table Curation: Gathering diverse real-world tables.
- Operator Chain Construction: Mining transformation patterns from production pipelines to build an empirical operator transition matrix. This matrix guides the sampling of realistic multi-step operator chains in the defined DSL via a Markov process, ensuring structural diversity and executable sequences. Task difficulty is categorized (Easy, Medium, Hard) based on chain length and operator types.
- Instruction Generation: Using LLMs like GPT-4o to generate initial, schema-aware NL descriptions from the input-output pairs and DSL chains, followed by style-controlled refinement for fluency and user-centric expression.
- Rule-Based Code Compilation: Deterministically compiling each DSL program into executable Pandas code using a rule engine that maintains and propagates schema state across steps to ensure correctness and validation.
- Multi-phase Validation: Rigorously validating the generated data through automated execution verification (checking if compiled code produces the target output) and human expert review to ensure semantic alignment between NL instructions and DSL programs.
PARROT is shown to be significantly larger and more complex than prior benchmarks, featuring an average chain length of 4.24 operations and covering 16 core DP operators (e.g., filter, groupby, join, pivot, rename). The instructions exhibit high lexical richness and structural variation compared to other datasets.
The paper evaluates cutting-edge LLMs on PARROT across three settings: zero-shot prompting, fine-tuned models, and agent-based planning. Key findings include:
- Zero-shot LLMs perform reasonably well on Easy tasks but struggle with the compositional complexity of Medium and Hard tasks, revealing limitations in understanding multi-step instructions and schema evolution.
- Fine-tuned open-source models like Qwen2.5-Coder-7B achieve better performance (74.15% EA) than zero-shot models, demonstrating the value of high-quality supervised data.
- Generating programs in the proposed DSL significantly outperforms direct generation of Pandas code or SQL statements (62.88% EA for DSL vs. 33.80% for Pandas and 3.05% for SQL), highlighting the DSL's effectiveness for structured generation and execution.
To address the challenges in multi-step reasoning, especially concerning schema evolution and dynamic interaction, the authors propose Pipeline-Agent. This agent employs a ReAct-style iterative loop, predicting an operation, executing it on the current table state, and using the intermediate results for context-aware planning. Pipeline-Agent achieves the best overall performance (76.17% EA with GPT-4o), substantially outperforming other agent-based methods like Tool Calling API (60.48% EA) and Plan-and-Solve (47.40% EA). This demonstrates that leveraging intermediate table feedback is crucial for accurately generating complex, multi-step pipelines.
Error analysis on zero-shot GPT-4o reveals common failure modes, with Type and Column errors accounting for the majority (63.6%), often due to incorrect schema references or incompatible operations after transformations. Semantic failures (27.2%), such as missing or hallucinated steps, also indicate challenges in compositional reasoning. A case paper illustrates a failure where the model omits an initial 'select' operation, leading to cascading errors and an incorrect output table, highlighting the need for robust instruction grounding and schema awareness across pipeline steps.
In conclusion, the paper introduces Text-to-Pipeline and the PARROT benchmark, providing a standardized evaluation ground for NL-driven DP pipeline generation. The empirical results demonstrate that while current LLMs can generate valid programs, achieving high execution accuracy on complex, multi-step tasks remains challenging, particularly regarding instruction grounding, compositional reasoning, and handling dynamic schema evolution. The proposed Pipeline-Agent offers a promising direction by integrating iterative planning and execution feedback, paving the way for more capable and autonomous data preparation systems.