XiYan-SQL: Multi-Generator NL2SQL Framework
- The paper introduces XiYan-SQL, an end-to-end Text-to-SQL framework that uses a multi-generator ensemble to enhance SQL generation.
- It employs schema filtering, diverse supervised fine-tuned and in-context SQL generators, and a self-refinement module to optimize query construction.
- Empirical results demonstrate state-of-the-art performance on benchmarks like BIRD and Spider, showcasing improved execution accuracy and robustness.
XiYan-SQL is an end-to-end framework for Text-to-SQL (NL2SQL) translation that establishes a new paradigm via a multi-generator ensemble methodology. It pioneers the integrated use of schema filtering, multiple supervised-fine-tuned and in-context SQL generators, self-refinement modules, and a learned selection model. XiYan-SQL achieves state-of-the-art execution accuracy across competitive NL2SQL benchmarks, including 75.63% on BIRD and 89.65% on Spider, surpassing prior methods and demonstrating strong generalization (Gao et al., 2024, Liu et al., 7 Jul 2025).
1. System Architecture and Key Components
XiYan-SQL decomposes the NL2SQL process into distinct, specialized components, each responsible for addressing tractable subproblems in robust SQL semantic parsing.
- Schema Filter: Filters the input database schema to derive multiple relevant schema subsets for each user query, reducing noise and computational overhead. Filtering uses LLM-based keyword extraction followed by embedding similarity to match columns/tables and iterative selection to balance precision and recall, resulting in a set of filtered schemas.
- Multi-Generator Ensemble: For each filtered schema, an ensemble of SQL generators (Supervised Fine-Tuned and ICL-based) produces a diverse pool of candidate queries. Each generator is fine-tuned or prompted on different auxiliary tasks and SQL formats to encourage diversity as well as precision.
- Self-Refiner Module: Executes each candidate SQL. In case of failure, the generator is re-prompted (or the candidate is re-generated) based on the exception feedback, correcting logical or syntactic errors such as missing joins or mismatched types.
- Selection Model with Candidate Reorganization: SQL candidates are grouped by identical execution outputs. A candidate reorganization strategy (cluster by majority, then order by generator reliability and brevity) feeds into a lightweight fine-tuned LLM that ranks and selects the optimal candidate for final execution.
The model pipeline is depicted in the table below.
| Component | Function | Techniques |
|---|---|---|
| Schema Filter | Relevant sub-schema extraction | LLM keyword extraction, embedding similarity |
| SQL Generators | Diverse candidate SQL generation | Multi-task SFT, ICL prompting, format variation |
| Refiner | Error correction via execution feedback | Self-refinement using exception traces |
| Selector | Final candidate ranking | Fine-tuned LLM, candidate reorganization |
2. M-Schema: Semi-Structured Schema Representation
To enhance model awareness of intricate database structures, XiYan-SQL introduces the M-Schema representation.
- Definition: For a database with tables , columns , and foreign key relations , each column is a 5-tuple . M-Schema represents as a flat sequence enumerating tables, their columns, and FK relations:
- Formalization:
- Motivation: M-Schema enables both SFT and ICL-based models to access fine-grained schema context and relationships, empirically improving execution accuracy by up to +2.03% absolute over DDL or MAC-SQL schema representations (Gao et al., 2024).
3. Multi-Generator Candidate Construction
XiYan-SQL leverages an ensemble of SQL generators to optimize both quality and diversity in candidate space.
3.1 Supervised Fine-Tuned (SFT) Generators
Each SFT model undergoes a two-stage training pipeline:
- Basic-Syntax Stage: Induces SQL syntax fluency using large dialect-agnostic corpora.
- Generation-Enhance Stage: Multi-task training on:
- Question SQL,
- SQL Question reconstruction,
- Evidence selection,
- SQL discrimination/regeneration with execution feedback.
The objective for each task () is:
Overall model loss is . Multi-format enhancement is achieved by training on diversified SQL rewrites (chunked, standardized, mixed) to create models with distinct output styles.
3.2 In-Context Learning (ICL) Generator
The ICL-based generator (e.g., GPT-4o) selects exemplars using Named-Entity Masked Skeleton Similarity:
- Named entities in the query are masked, embeddings are computed, and K-nearest exemplars are selected via cosine similarity.
- If schema-linking yields tables, only exemplars with multi-table joins are considered.
3.3 Refiner Module
For each candidate , if initial execution yields an error, the refiner LLM is prompted with to repair the SQL. This self-refinement loop typically corrects execution-critical syntactic/logic errors.
4. Candidate Reorganization and Selection
Majority voting is suboptimal when correct solutions are "minority" in the candidate set. XiYan-SQL's selection process includes:
- Candidate Reorganization: Candidates are clustered by execution result; clusters are sorted by size (consensus) and generator reliability, with within-cluster ordering favoring brevity or reliability. If no consensus majority, the shortest candidate from each cluster is prioritized.
- Learned Selection Model: A lightweight LLM is fine-tuned to select among the reorganized candidate list :
The model predicts the index , yielding the final SQL . The training objective is the standard cross-entropy on gold target selection.
Ablations confirm this module contributes 3–4% EX performance over naive majority-voting(Liu et al., 7 Jul 2025).
5. Empirical Performance and Benchmarking
Extensive benchmarking on BIRD, Spider, SQL-Eval, and NL2GQL demonstrate the efficacy and generalizability of XiYan-SQL.
| Benchmark | XiYan-SQL EX (%) | Prior SOTA | Improvement |
|---|---|---|---|
| BIRD | 75.63 | 74.79 | +0.84 (vs. CHASE-SQL+Gemini) |
| Spider | 89.65 | 89.60 | +0.05 (vs. MCS-SQL+GPT-4) |
| SQL-Eval | 69.86 | 67–68 | +1-2 points |
| NL2GQL | 41.20 | <18 | >23 points |
- XiYan-SQL further improves over closed-source LLMs by substantial margins in generalization/robustness tests on PostgreSQL and MySQL datasets (Liu et al., 7 Jul 2025).
- Ablations demonstrate each module's necessity: removing multi-generator diversity, schema filter, or selection model drops EX by up to 4.04%, 1.24%, and 3.13%, respectively. The multi-generator oracle is significantly above achieved EX, indicating selection remains a bottleneck.
6. Error Analysis and Observed Strengths
- Schema Filter reduces "no-table-found" errors by up to 30%.
- Diverse Generators enable handling complex SQL constructs such as window functions and multi-way joins, capturing distributional tails.
- Selection Model mitigates errors where high-confidence single model outputs are semantically incorrect (e.g., improper GROUP BY or filter conditions).
- Refiner repairs minor but execution-critical SQL errors that would otherwise cause output elimination.
Collectively, these factors yield state-of-the-art and robust performance across a wide spectrum of database schema and query complexities.
7. Open Problems and Prospective Directions
XiYan-SQL demonstrates strong generalization and efficiency, yet limitations remain:
- Scaling filter/generation stages to more schema subsets () and generator models () may further increase oracle upper bounds.
- Incorporating richer execution feedback, such as query plan costs, could enhance the refiner's correction capabilities.
- Integrating the multi-stage pipeline into a unified, multi-task, multi-format curriculum-trained "all-in-one" model is an avenue for future research (Gao et al., 2024, Liu et al., 7 Jul 2025).
XiYan-SQL exemplifies a modular, interpretable, and extensible architecture for complex semantic parsing, establishing a new reference framework for industrial NL2SQL deployments and further academic exploration.