Natural Language to SQL
- Natural Language to SQL is a set of computational methods that map natural language queries to executable SQL, enabling both experts and non-experts to access relational databases.
- Modern NL2SQL systems leverage encoder-decoder architectures, pre-trained language models, and retrieval-augmented techniques to enhance query accuracy and handle varied schema complexities.
- Advanced approaches employ task decomposition, reinforcement learning, and agentic modular systems to improve execution reliability, efficiency, and interpretability in real-world applications.
Natural Language to SQL (NL2SQL) denotes computational techniques and systems that translate natural language (NL) questions into executable SQL queries. Modern NL2SQL empowers both technical and non-technical users to access structured data in relational databases, removing the necessity for explicit knowledge of SQL or database internals. The field has evolved rapidly, enabled by advances in deep learning, pre-trained LLMs (PLMs), and LLMs, and now encompasses a diverse ecosystem of open-source toolkits, evaluation benchmarks, agentic systems, and hybrid solutions that address a variety of practical, performance, and interpretability requirements.
1. Formal Problem Definition and Task Landscape
The canonical NL2SQL task is to learn a function mapping a natural language statement and database schema to an SQL query , such that executing on yields the correct answer:
Given gold standard triples , the objectives can be formulated as either maximizing exact-set match accuracy, logical-form (component-wise) accuracy, or execution accuracy (correct result set). Modern formulations require generalization across unseen schemas, schemas with complex relations, multilingual databases, and robustness to ambiguous or paraphrased questions (Li et al., 2024).
NL2SQL research now routinely distinguishes between:
- Zero-shot and few-shot NL2SQL: Models must generalize to schemas and question types unseen during training (Gu et al., 2023).
- Schema-aware and table-aware NL2SQL: Systems that model not just column names, but also data types, cell values, and table-level relationships (Sun et al., 2020, Huang et al., 2021, Onyango et al., 25 Feb 2026).
- Workload-adaptive NL2SQL: Pipelines tuned to the recurrent query patterns and join-paths occurring in the actual query workload of a deployed database (Vaidya et al., 29 May 2025).
2. Core Methodologies and Pipeline Architectures
The dominant NL2SQL methodologies have progressed through several paradigms:
Encoder–Decoder and Sketch-Based Systems
Early neural approaches encode NL and schema representations to produce a SQL “sketch” (high-level structure), then predict components (SELECT, FROM, WHERE...) via slot-filling or classification (Pal et al., 2020, Sun et al., 2020). Hierarchical multi-task frameworks, such as MLEG-SQL, employ dedicated expert networks for each structural subtask, mitigating negative migration among conflicting classification subtasks (Hao et al., 2023).
Table-aware systems introduce additional pointer or matching mechanisms for effective value normalization—a necessity for real-world queries where synonyms and paraphrases dominate condition values (Sun et al., 2020).
Pre-trained LM and Retrieval-Augmented Generation (RAG)
PLM-driven models (e.g., RoBERTa, ELECTRA, T5) encode questions and schema, but are often augmented with retrieval mechanisms (BM25, SBERT, embedding stores) that supply context from the schema, documentation, or historical queries. RAG systems (Blar-SQL, DeKeySQL, TailorSQL) explicitly retrieve relevant schema chunks, query hints, and domain-specific join-paths within the LLM generation loop (Domínguez et al., 2024, Chen et al., 18 Sep 2025, Vaidya et al., 29 May 2025).
Task decomposition—splitting the overall mapping into subtasks for schema-linking, entity/keyword extraction, or AST-guided subproblem solving—is widely adopted to improve robustness and reduce the semantic gap between NL and SQL (Domínguez et al., 2024, Liao et al., 3 Apr 2025).
LLM and Prompt-Based Methods
With the advent of LLMs, prompting and demonstration-based conditioning (PURPLE, DAIL-SQL, DIN-SQL, SuperSQL) have supplanted, or coexist with, fine-tuning. Strategically retrieved demonstration sets guide the LLM in complex logical operator composition, while specialized prompt construction handles schema pruning and paraphrasing (Ren et al., 2024, Li et al., 2024).
Execution-guided selection or self-consistency voting over multiple LLM generations is common, as is in-context correction (cycle-based or agentic loops) to handle failure or error-prone predictions (Fan et al., 2024, Jehle et al., 16 Oct 2025).
Agentic and Modular Systems
In enterprise and large-schema scenarios, agentic NL2SQL solutions reduce token usage and cost by selectively fetching schema fragments through an interactive LLM loop instead of monolithic context windows (Jehle et al., 16 Oct 2025, Onyango et al., 25 Feb 2026). Modularization and fallback routing between small (on-prem) LMs and LLMs improve privacy and cost-efficiency at some cost to execution accuracy (Onyango et al., 25 Feb 2026).
3. Advances in System Design: Decomposition, Task Specialization, and Data Engineering
Task Decomposition and AST Guidance
Sophisticated NL2SQL models (e.g., LearNAT) leverage task decomposition and AST-guided margin-based reinforcement learning. Instead of direct sequence mapping, these systems decompose input into subtasks aligned to Abstract Syntax Tree (AST) components of the target SQL, guide component-level choices by tree similarity, and reinforce correct behavior via preference optimization using the AST reward margin (Liao et al., 3 Apr 2025). This design enables small open-source LLMs (e.g., Qwen2.5-7B) to approach GPT-4 performance on both Spider and BIRD benchmarks.
Reinforcement Learning and Execution-Based Reward
Supervised fine-tuning (SFT) does not directly optimize for execution correctness. Enhanced methods (SQL-R1) recast NL2SQL as a reinforcement learning problem, where the objective is to maximize expected execution-based rewards, including syntactic validity, execution correctness, semantic result accuracy, and reasoning trace completeness (Ma et al., 11 Apr 2025). GRPO (Group Relative Policy Optimization) and custom reward functions tailored for SQL further raise execution reliability.
Synthetic Data, Workload-Aware and Reverse-Generation
Synthetic data (e.g., SynSQL-2.5M) are used both for SFT and RL fine-tuning, enabling models to learn robustly across varying degrees of SQL complexity without costly annotation (Ma et al., 11 Apr 2025). In enterprise, SQLord employs reverse-generation from real workload SQL to natural language, automatically producing large high-fidelity <NL, SQL> pairs for domain-specific fine-tuning (Cheng et al., 14 Jul 2025). TailorSQL leverages both schema and historical workload data, combining prompt specialization with query-hint documents for improved accuracy and latency (Vaidya et al., 29 May 2025).
4. Robustness, Ambiguity, and Interpretability
Ambiguity Management and Multicandidate Recommendation
NDL2SQL systems deployed in enterprise face ambiguous schemas with overlapping table/column names. ODIN generates sets of diverse SQL candidates by controlled schema masking and prioritizes those most relevant via conformal prediction, dynamically adjusting the number of presented candidates based on the measured ambiguity. Explicit user-feedback closed the personalization loop to bias future recommendations—substantially improving correct-query-in-set rates (Vaidya et al., 25 May 2025).
Self-Evaluation and Query Rewriting
CycleSQL introduces an iterative framework for NL2SQL self-correction: each candidate SQL is augmented with data-grounded NL explanations, and an NLI verifier checks entailment between explanation and original NL. This self-feedback loop improves both execution accuracy and user interpretability, with safe fallback to the top candidate in the absence of entailment (Fan et al., 2024). REWRITER is a plug-and-play module that diagnoses and rephrases problematic NL inputs before SQL generation, improving accuracy across black-box NL2SQL models (Ma et al., 2024).
Efficiency, Cost, and Privacy
Agentic paradigms (Datalake Agent) reduce the token footprint and LLM call cost by up to 87% via interactive reasoning and metadata selective fetches (Jehle et al., 16 Oct 2025). Hybrid SLM/LLM architectures resolve the majority of queries locally, only incurring LLM API cost for the most difficult or ambiguous queries, resulting in over 90% cost savings at a moderate loss in maximum attainable accuracy (Onyango et al., 25 Feb 2026). Data-agnostic models deliberately avoid using table cell values to guarantee privacy and federated deployability (Pal et al., 2020).
5. Benchmarks, Evaluation, and Comparative Assessments
NL2SQL systems are evaluated on benchmarks drawn from both English (Spider, BIRD, WikiSQL, ScienceBenchmark, KaggleDBQA) and non-English corpora (TableQA Chinese corpus). Key metrics are:
- Execution Accuracy (EX): Main metric—the fraction of outputs whose SQL results match the gold reference on execution.
- Exact-Set and Logical-Form Match (EM): Stronger form requiring set-level structural and semantic equivalence.
- Valid Efficiency Score (VES): Correct predictions rewarded inversely by runtime (Li et al., 2024).
- Query Variance Testing (QVT): Robustness to paraphrase in input.
The NL2SQL360 evaluation suite enables fine-grained comparisons across complexity, SQL structural phenomena, domain shifts, and QVT (Li et al., 2024). Modular hybrid pipelines like SuperSQL, which combine schema-linking, prompt selection, greedy LLM decoding, and post-processing voting, achieve leading EX rates (87% on Spider, 62.66% on BIRD test).
Performance for recent state-of-the-art methods (Spider/BIRD, EX%) is summarized below:
| Method | Model | Spider-Dev | BIRD-Dev |
|---|---|---|---|
| SQL-R1 | Qwen2.5-7B | 87.6 | 66.6 |
| LearNAT | Qwen2.5-7B | 86.4 | 58.1 |
| PURPLE | ChatGPT/GPT-4 | 84.8/87.8 | — |
| Blar-SQL | Llama2+CodeLlama | 46.7 | 46.7 |
| TailorSQL | GPT-4 | 57.5 | 61.0 |
| SuperSQL | GPT-4 (hybrid) | 87.0 | 62.66 |
Reported numbers are from the cited works and may use different evaluation splits or data versions; see each paper for precise experimental setup.
6. Dataset Design, Multilingual, and Domain Adaptation
New datasets such as TableQA (Chinese) introduce both value paraphrase normalization and answerability subtasks (e.g., distinguishing truly unanswerable queries), raising the bar for entity-linking and robustness to expression diversity (e.g., abbreviations and transliterations), with condition-value prediction dropping from 95.1% (WikiSQL) to 46.8% (TableQA) for state-of-the-art baselines (Sun et al., 2020).
Workload-adaptive and reverse-generation approaches (TailorSQL, SQLord) leverage organizational-specific usage patterns for domain adaptation, enabling significant performance lifts in practical, non-academic deployment scenarios (Vaidya et al., 29 May 2025, Cheng et al., 14 Jul 2025). NL2SQL is now routinely adopted in low-code platforms via continuous data collection and A/B tested deployment loops (Aparicio et al., 2023).
7. Limitations, Open Challenges, and Future Directions
Despite substantial progress, critical challenges persist:
- Interpretability and Trust: Systems must associate each SQL fragment to its NL source span and provide natural-language justifications, with some proposals generating human-readable explanations or interactive debugging feedback (Fan et al., 2024, Li et al., 2024).
- Ambiguity and User Preference: Dynamic clarification, multicandidate recommendation, and feedback integration remain open areas for improving user trust and translation success (Vaidya et al., 25 May 2025).
- Robustness to Distributional Shifts and Schema Drift: Heavy reliance on benchmark data fails to capture the drift in real-world database usage and schema evolution, motivating workload-adaptive embedding and bandit-based abstention (Vaidya et al., 29 May 2025).
- Efficient, Privacy-Preserving, and Low-Latency Inference: Agent-routing, SLM/LLM hybridization, and federated model design are active research directions for accommodating cost, data privacy, and resource constraints (Onyango et al., 25 Feb 2026).
- Scalability: Handling of large and complex schemas (hundreds of tables/columns) without exceeding model context or truncating critical schema context is being addressed by chunked or interactive reasoning (Domínguez et al., 2024, Jehle et al., 16 Oct 2025).
Prospective advances include active learning with human validation, tighter integration of execution feedback, dynamic decomposition/classification policies, and domain-aware synthetic data augmentation to close the remaining performance and usability gaps (Ma et al., 11 Apr 2025, Liao et al., 3 Apr 2025, Vaidya et al., 29 May 2025).
References:
- (Li et al., 2024) The Dawn of Natural Language to SQL: Are We Fully Ready?
- (Domínguez et al., 2024) Blar-SQL: Faster, Stronger, Smaller NL2SQL
- (Ma et al., 11 Apr 2025) SQL-R1: Training Natural Language to SQL Reasoning Model By Reinforcement Learning
- (Fan et al., 2024) Grounding Natural Language to SQL Translation with Data-Based Self-Explanations
- (Vaidya et al., 25 May 2025) ODIN: A NL2SQL Recommender to Handle Schema Ambiguity
- (Chen et al., 18 Sep 2025) DeKeyNLU: Enhancing Natural Language to SQL Generation through Task Decomposition and Keyword Extraction
- (Vaidya et al., 29 May 2025) TailorSQL: An NL2SQL System Tailored to Your Query Workload
- (Liao et al., 3 Apr 2025) LearNAT: Learning NL2SQL with AST-guided Task Decomposition for LLMs
- (Onyango et al., 25 Feb 2026) An Agentic System for Schema Aware NL2SQL Generation
- (Jehle et al., 16 Oct 2025) Agentic NL2SQL to Reduce Computational Costs
- (Ma et al., 2024) A Plug-and-Play Natural Language Rewriter for Natural Language to SQL
- (Sun et al., 2020) TableQA: a Large-Scale Chinese Text-to-SQL Dataset for Table-Aware SQL Generation
- (Pal et al., 2020) Data Agnostic RoBERTa-based Natural Language to SQL Query Generation
- (Gu et al., 2023) Interleaving Pre-Trained LLMs and LLMs for Zero-Shot NL2SQL Generation
- (Ren et al., 2024) PURPLE: Making a LLM a Better SQL Writer
See individual papers for implementation specifics, ablation studies, and further experimental analyses.