Agentic Text-to-SQL Systems

Updated 3 January 2026

Agentic Text-to-SQL systems are modular frameworks that use multiple LLM-based agents to transform natural language into executable SQL.
They decompose the task into subtasks such as schema linking, query planning, SQL generation, and execution-driven correction, ensuring robust performance.
These systems offer enhanced interpretability, dynamic error correction, and scalability, outperforming monolithic models on benchmarks like Spider and BIRD.

Agentic Text-to-SQL systems are a class of automated frameworks that employ multiple interacting agents—each typically implemented as a LLM or a specialized module—to convert natural language queries into syntactically valid, executable SQL statements. In contrast to monolithic, single-shot models, agentic systems decompose the Text-to-SQL task into sequential or parallel subtasks (e.g., schema linking, query planning, clause decomposition, verification, correction), enabling more robust, verifiable, and interpretable query generation under complex schema, conversational, or multi-turn settings. This architecture supports dynamic reasoning, explicit feedback loops, interactive tool usage, test-time scaling, and production adaption across a wide variety of Text-to-SQL challenges, including conversational NL2SQL, segmented/partitioned databases, cost-aware large-schema inference, and cross-dialect generalization.

1. Architectural Principles and Taxonomy of Agentic Text-to-SQL Systems

Agentic Text-to-SQL systems are structured around the modular delegation of subtasks to agent modules, each handling a discrete aspect of the pipeline such as schema pruning, query planning, SQL synthesis, execution-driven validation, or interactive correction. Foundational systems including SQL-of-Thought (Chaturvedi et al., 30 Aug 2025), AGENTIQL (Heidari et al., 12 Oct 2025), MATS (Hoang et al., 21 Dec 2025), and Squrve (Wang et al., 28 Oct 2025) instantiate agents as distinct LLM-based routines orchestrated via a central controller, often following an explicit pipeline (sequential/parallel) or dynamic agent collaboration protocol.

Representative agentic pipeline breakdowns include:

Schema Pruning/Linking Agent: Identifies and selects the relevant subset of schema entities for a given question (Hoang et al., 21 Dec 2025, Yang et al., 2 Nov 2025, Wang et al., 28 Oct 2025).
Planning/Decomposition Agent: Performs clause-level decomposition or chain-of-thought planning, mapping NL intent onto explicit SQL construction steps (Chaturvedi et al., 30 Aug 2025, Heidari et al., 12 Oct 2025).
Generation/Coding Agent: Synthesizes clause-level or full SQL, with optional iterative self-correction (Heidari et al., 12 Oct 2025, Ahmed et al., 6 Nov 2025).
Execution/Validation Agent: Executes SQL candidates, checks result set accuracy, and initiates guided correction if necessary (Hoang et al., 21 Dec 2025, Chaturvedi et al., 30 Aug 2025).
Refinement/KV-Fixer Agent: Refines the SQL via in-context execution error diagnosis and structured edit plans (Chaturvedi et al., 30 Aug 2025, Hoang et al., 21 Dec 2025).
Selection/Aggregation/Judge Agent: Reranks or selects the final SQL from a candidate pool, optionally using generative or tournament-style scoring (Ahmed et al., 6 Nov 2025, Wang et al., 29 Sep 2025).

This task decomposition aligns with hierarchical or ReAct-style (Think–Act–Observe) control flows, as exemplified by MARS-SQL (Yang et al., 2 Nov 2025) and spatio-temporal pipelines (Redd et al., 29 Oct 2025).

2. Interaction Protocols, Feedback Loops, and Orchestration Strategies

Agentic Text-to-SQL frameworks universally employ structured interaction protocols to facilitate agent collaboration, knowledge sharing, and dynamic error correction:

Iterative Proposal–Execution–Verification–Refinement: Systems like MTSQL-R1 (Guo et al., 12 Oct 2025) formalize Text-to-SQL as an MDP traversed via propose–execute–verify–refine cycles, with database feedback and dialogue memory enforced at each step to ensure coherence and executability.
Consensus and Discussion Protocols: BAPPA (Ahmed et al., 6 Nov 2025) and R³ (Review–Rebuttal–Revision) (Xia et al., 2024) realize multi-agent “debate” or consensus voting, where candidate SQLs are iteratively critiqued, revised, and synthesized by peer agents and a judge.
Tournament and Selection: Agentar-Scale-SQL (Wang et al., 29 Sep 2025) leverages multiple SQL generators in parallel, feeding candidates into a round-robin RL-trained “tournament selector” that identifies the highest-quality execution.
Decentralized Collaboration with Privacy Segmentation: CSMA (Wu et al., 2024) enables multi-agent query composition over partitioned databases; each agent shares only necessary schema fragments for robust SQL generation while preserving data privacy constraints.

Execution feedback drives self-correction and preference optimization across approaches. For instance, SQL-of-Thought (Chaturvedi et al., 30 Aug 2025) and ExeSQL (Zhang et al., 22 May 2025) advocate taxonomy-driven correction plans and DPO-based preference learning guided strictly by real execution environments, not manual annotations.

3. Formalization and Learning Frameworks

Several systems cast agentic Text-to-SQL as sequential decision-making under uncertainty, leveraging RL or preference-based learning to close the gap between LLM text generation and executable semantics:

MDP/POMDP Modeling: MTSQL-R1 (Guo et al., 12 Oct 2025) and AGRO-SQL (Yang et al., 29 Dec 2025) define the system state as a tuple comprising dialogue history, schema, candidate SQL, memory, and execution feedback, with actions corresponding to agent “moves” (propose, execute, verify, correct). The optimization objective is the expected cumulative reward over SQL–NL trajectories, with terminal rewards based on exact execution match.
Group-Relative Policy Optimization (GRPO): Both MARS-SQL (Yang et al., 2 Nov 2025) and AGRO-SQL (Yang et al., 29 Dec 2025) utilize GRPO to stabilize RL under sparse rewards, using the group-average trajectory return as the baseline in policy gradients:

$J_{GRPO}(\theta) = \mathbb{E}_{\tau_{1:N}\sim\pi_\theta}\left[\sum_{i=1}^N (R(\tau_i)-\bar{R})\,\log\pi_\theta(\tau_i)\right]$

Reinforcement Learning with Execution Feedback (RLEF/GRPO): MATS (Hoang et al., 21 Dec 2025), Agentar-Scale-SQL (Wang et al., 29 Sep 2025), and ExeSQL (Zhang et al., 22 May 2025) all ground their policy improvement loops directly in execution-based reward signals, eschewing human labels in favor of binary or continuous metrics derived from query result sets.
Preference Optimization/DPO: ExeSQL (Zhang et al., 22 May 2025) and MATS (Hoang et al., 21 Dec 2025) employ direct preference optimization, updating policies to increase the probability of producing executable, correct SQL over invalid or near-miss alternatives.

4. Empirical Performance, Robustness, and Scalability

Agentic Text-to-SQL systems consistently surpass monolithic baselines and, in several cases, approach or match SOTA execution accuracy on standard benchmarks (Spider, BIRD, DataBench):

System	Benchmark	Model Scale	Execution Accuracy (EX%)
SQL-of-Thought	Spider-dev	GPT-4/Claude 3 Opus	91.6
AGENTIQL	Spider-test	Qwen2.5-14B	86.07
Agentar-Scale-SQL	BIRD-test	Intrinsic+ICL+RL	81.67
MARS-SQL	BIRD-dev	7B, RL+Verifier	77.84
MATS	Spider-dev	9B (SLMs)	87.1
Datalake Agent	RelBench (319)	GPT-4o Mini	60 (vs. 40 for baseline)

These results are achieved while providing additional benefits:

Interpretability: Transparent intermediate outputs (planning steps, clause decomposition, justification traces) enable auditability (Chaturvedi et al., 30 Aug 2025, Heidari et al., 12 Oct 2025).
Robustness: Dynamic error correction loops, guided error taxonomies, and in-context verifiers deliver material reductions in catastrophic errors on syntactic/semantic variants, out-of-domain evaluation, and multi-turn dialogue consistency (Chaturvedi et al., 30 Aug 2025, Guo et al., 12 Oct 2025, Hoang et al., 21 Dec 2025).
Scalability: Agentic paradigms facilitate parallel subtask execution, test-time scaling, cost-aware inference, and hardware-specific scheduling (see HEXGEN-TEXT2SQL (Peng et al., 8 May 2025), Datalake Agent (Jehle et al., 16 Oct 2025)).

5. Practical Considerations: Scheduling, Cost, Privacy, and Production Adaptation

Agentic workflows introduce unique opportunities and challenges for deployment:

Scheduling and Latency Control: HEXGEN-TEXT2SQL (Peng et al., 8 May 2025) designs a two-level scheduler to map interdependent LLM inference tasks to heterogeneous GPU clusters, maximizing SLO compliance and throughput (reducing deadlines by up to 1.67×, throughput by 1.75× relative to vLLM).
Prompt/Token Cost Reduction: Datalake Agent (Jehle et al., 16 Oct 2025) reduces prompt size by up to 87% via interactive schema fetch loops, lowering LLM API cost while maintaining or improving accuracy as database scale grows.
Data Privacy: Segmented agent frameworks (CSMA (Wu et al., 2024)) partition schema knowledge and only share minimal relevant fragments during query generation and verification, ensuring no agent ever needs to expose its entire schema.
Small-LLM Deployment: MATS (Hoang et al., 21 Dec 2025) achieves large-LLM-level performance using agentic specialization and execution feedback alignment on SLMs, enabling in-house, privacy-aware, and resource-constrained Text-to-SQL deployments.

6. Limitations and Future Directions

Despite performance gains, agentic Text-to-SQL systems face several open challenges:

Latency and Complexity: Multi-stage agentic pipelines entail higher computational costs and increased inference time relative to single-pass models. Strategies such as adaptive routing (AGENTIQL (Heidari et al., 12 Oct 2025)) and orchestrated scaling (Agentar-Scale-SQL (Wang et al., 29 Sep 2025)) offer partial mitigation.
Generalization to Massive/Non-Relational Schemas: While current benchmarks remain manageable (≤ 300 tables), true industrial-scale DBs may pose retrieval and reasoning bottlenecks; modularity and tree-structured agent topologies (Squrve (Wang et al., 28 Oct 2025)) are promising but require further validation.
Complex/Nested Query Patterns: Complex subqueries, deeply nested joins, and advanced SQL dialect features remain nontrivial, necessitating either additional agent classes (e.g., Subquery Agent) or recursive ReAct-style planning.
Dynamic, Autonomous Agent Control: Research is progressing toward exercise-time scaling, autonomous agent pipelines, and meta-learning for prompt/controller optimization (Wang et al., 29 Sep 2025, Yang et al., 29 Dec 2025).

Agentic Text-to-SQL frameworks thus constitute a modular, extensible paradigm that leverages LLM strengths in reasoning, interaction, and tool use under explicit control structures, yielding transparent and robust systems for NL2SQL generation across diverse, scalable, and operational settings.