Taxonomy-Guided Text-to-SQL Synthesis & Evaluation
- The paper presents a taxonomy-guided framework that leverages formally defined taxonomies for systematic dataset synthesis and targeted error correction in Text-to-SQL systems.
- It employs dual-path diversity expansion via LLM-driven slot filling and schema generation, ensuring comprehensive coverage of SQL constructs and complexity levels.
- The methodology enables precise error analysis with a multi-agent correction loop, achieving state-of-the-art execution accuracy on benchmark datasets.
Taxonomy-guided Text-to-SQL represents a principled approach to semantic parsing that leverages formally defined taxonomies for both dataset construction and model error correction. Two major lines of research exemplify this trend: SQL-of-Thought, which integrates taxonomy-awareness into model inference and iterative error correction (Chaturvedi et al., 30 Aug 2025), and the SQL-Synth/TaxoBench initiative, whose taxonomy-guided synthesis pipeline produces benchmarks reflecting the full logical and structural spectrum of real-world text-to-SQL usage (Wang et al., 17 Nov 2025). These frameworks serve as cornerstones for diagnosing, training, and evaluating the capabilities of LLM-powered text-to-SQL systems, with explicit coverage for errors, intents, SQL constructs, and domain adaptations.
1. Taxonomy Definitions in Text-to-SQL
Formally, the taxonomy used to guide Text-to-SQL evaluation and synthesis is defined as the tuple:
where:
- CT (Core Intents): high-level semantic purposes (basic_query, condition_filtering, sorting_pagination, basic_aggregation, time_operation, format_transformation, set_operation, data_change, structure_change, distribution_analysis, advanced_statistics, trend_analysis, business_calculation, business_rule)
- ST (Statement Types): SQL statement categories (SELECT, INSERT, UPDATE, DELETE, ALTER)
- SS (Syntax Structures): SQL syntactic constructs (Where, OrderBy, LimitOffset, InnerJoin, CrossJoin, OuterJoin, GroupBy, Having, Union, Intersect, Except, ScalarSubquery, CorrelatedSubquery, CommonTableExpression)
- KA (Key Actions): functional capabilities or operator requirements (SpecificTime, WildcardFiltering, TimeFunction, JsonFunction, WindowFunction, StringFunction, Cast, ConditionJudgement, AggregateFunction)
Every text-to-SQL example receives an annotation . Complexity is assigned by a function mapping each tag to a score, then summed and bucketed to determine simple/medium/hard levels (Wang et al., 17 Nov 2025).
In error analysis, an orthogonal taxonomy captures failure modes spanning syntax, schema mapping, join logic, predicate construction, aggregation, value representation, subqueries, set operations, and structural oversights. Each is assigned a code E1–E9, with up to 31 fine-grained sub-categories (Chaturvedi et al., 30 Aug 2025).
2. Taxonomy-Guided Dataset Synthesis: SQL-Synth
SQL-Synth exemplifies taxonomy-driven dataset construction. The synthesis follows a complexity- and coverage-aware algorithmic process:
- Cartesian product of taxonomy dimensions defines all possible combinations, subject to semantic/syntactic validity constraints.
- Enhanced database schemas are generated from WiKiSQL seed tables via LLM prompting and topological ordering to ensure relational consistency.
- Each valid taxonomy tuple receives up to five Spider-derived NL/SQL seed pairs, adapted by LLM “slot-filling” into newly generated schemas.
- Diversity expansion occurs along two axes:
- SQL-oriented: For each seed SQL, 50 schemas are sampled and new SQL/NLQ pairs are generated.
- NLQ-oriented: For each seed NLQ, 50 schemas are sampled and new NLQ/SQL pairs are generated. Execution- and semantic-based filters, as well as value-mapping augmentation, ensure diversity and correctness.
Explicitly, all taxonomy facets are guaranteed full coverage ( on the constructed SQL-Synth dataset), with three-fold complexity levels and extensively measured diversity and semantic clustering (Wang et al., 17 Nov 2025).
| Dataset | #DB | CoreIntent | StatementType | SyntaxStruct | KeyAction |
|---|---|---|---|---|---|
| Spider | 200 | 0.79 | 0.20 | 0.71 | 0.33 |
| Bird | 95 | 0.86 | 0.20 | 0.57 | 0.89 |
| SQL-Synth | 1250 | 1.00 | 1.00 | 1.00 | 1.00 |
This table illustrates the comprehensive coverage of SQL-Synth versus prior benchmarks.
3. Multi-Agent Taxonomy-Guided Text-to-SQL Models
The SQL-of-Thought architecture operationalizes taxonomy-guided reasoning in Text-to-SQL parsing and error correction. The system decomposes the process into these agents, each realized as an LLM prompt:
- Schema Linking Agent: Extracts relevant schema components.
- Subproblem Agent: Maps query logic to structured clause intents.
- Query Plan Agent: Produces a chain-of-thought execution plan (stepwise reasoning over the schema).
- SQL Agent: Renders SQL code from plan.
- Correction Plan Agent (taxonomy-guided): On any query-execution failure, uses the error taxonomy to classify the failure, then sketches a chain-of-thought correction plan.
- Correction SQL Agent: Generates a revised SQL statement based on the plan, explicitly directed to avoid previous error categories.
All intermediate artifacts, including taxonomy labels and error codes, are serialized and explicitly referenced in LLM prompts, enforcing a “factored” learning protocol and transparent correction loop (Chaturvedi et al., 30 Aug 2025).
4. Taxonomy-Guided Correction and Error Analysis
The error taxonomy (E1–E9) informs both the targeting of LLM interventions and the interpretability of model outputs. Upon failed SQL generation, the pipeline passes the failed SQL, error trace, and taxonomy codes to the correction agent, which:
- Identifies the error class (e.g., “E5.1: Missing GROUP BY for aggregated column”).
- Produces an explicit repair strategy via CoT.
- Feeds this to the correction agent, which is prompted to correct the SQL while avoiding recurrence of the tagged error.
This explicit multi-turn correction loop sharply increases accuracy. On Spider dev mini-batch with Claude 3 Opus, the taxonomy-guided loop achieves 95% Execution Accuracy (EA), a state-of-the-art result (Chaturvedi et al., 30 Aug 2025).
5. Evaluation of Model and Dataset Performance
SQL-Synth supports fine-grained benchmarks for both generalization and rare/complex query robustness. Example execution accuracy results on SQL-Synth test split:
| Model | EX (%) |
|---|---|
| GPT-4o-mini | 79.57 |
| GPT-4-turbo | 81.24 |
| GPT-4o | 85.05 |
| Qwen3-Coder | 82.63 |
| Granite-3.1-8B-Instruct | 68.18 |
| Qwen2.5-Coder-7B-Instruct | 74.52 |
| Synth-Coder (ft) | 85.12 |
Fine-tuned models realize substantial gains, especially on structurally complex, low-frequency taxonomy bins (e.g., +30–40% absolute EX improvement for business_rule, json_function categories). Execution accuracy scales monotonically with dataset size (Wang et al., 17 Nov 2025).
Ablation on generation paths indicates that omitting the NLQ→SQL path drops type-token ratio by 25% and semantic clusters by 27%, underscoring the critical role of dual-path diversity in dataset quality.
6. Broader Implications and Applications
The taxonomy-guided paradigm enables:
- Systematic diagnosis of model errors, dataset gaps, and coverage across CT, ST, SS, KA axes (“TaxoBench” as a coverage auditing tool).
- Guided construction of training corpora with programmable guarantees for logical/syntactic diversity.
- Direct adaptation to new domains or business scenarios by modularly swapping domain-specific seed data during synthesis.
- Improved real-world robustness: models trained on SQL-Synth generalize better to analytic workloads requiring multi-table joins, nested logic, and complex business calculations.
- Cost-effective dataset construction at massive scale via LLM-driven augmentation, a template applicable to other domains of semantic parsing.
A plausible implication is that taxonomy-guided synthesis and error correction represent generalizable frameworks for evolving semantic parsing toward real-world readiness, by facilitating transparent diagnosis, targeted correction, and full-spectrum logical coverage (Chaturvedi et al., 30 Aug 2025, Wang et al., 17 Nov 2025).