SynSQL-Think-916K: A Synthetic Text-to-SQL Dataset

Updated 3 July 2026

The paper presents SynSQL-Think-916K’s novel dataset, enabling SLMs to boost Text-to-SQL generation through chain-of-thought reasoning.
It details a multi-stage data curation pipeline with strict filtering and annotation cleanup to ensure SQL validity and schema diversity.
Empirical results show significant execution accuracy gains for 0.5B and 1.5B models, marking practical improvements in SLM performance.

SynSQL-Think-916K is a large-scale synthetic dataset constructed for the supervised fine-tuning of small LLMs (SLMs) on the Text-to-SQL generation task, with a particular focus on chain-of-thought (CoT) reasoning. Derived from the broader SynSQL-2.5M corpus, SynSQL-Think-916K introduces rigorous data curation and annotation schemes, making it a critical resource for boosting the SQL generation performance of SLMs ranging from 0.5B to 1.5B parameters. Its design principles and empirical validation have established new benchmarks in SLM-driven Text-to-SQL modeling, particularly within the SLM-SQL framework (Sheng et al., 30 Jul 2025).

1. Dataset Genesis and Construction Pipeline

SynSQL-Think-916K originates from the SynSQL-2.5M corpus, a synthetic resource produced by the OmniSQL system and characterized by CoT annotations. The derivation of the 916K subset adopts a multi-stage filtration and preprocessing strategy:

Initial Filtering: Examples without the SQL SELECT keyword are eliminated, as are those containing comment artifacts (--) and those where redundant SQL queries replicate across the CoT.

Annotation Cleanup: Following SQL cleanup, all CoT reasoning steps are wrapped in > ...<think> tokens, and the SQL output is encapsulated in <answer>...<answer>. Post-SQL reflections or annotations are strictly removed.

Token Length Restriction: Samples exceeding a total token limit of 7,000 (covering prompt, CoT, and SQL) are pruned to maintain dataset tractability and prevent input truncation during model training.

Final Aggregate: After filtration and deduplication, the dataset contains precisely 916,156 examples, with no held-out split reserved—this partition is devoted exclusively to pre-training via supervised fine-tuning (SFT). Downstream validation leverages external benchmarks (BIRD, Spider) (Sheng et al., 30 Jul 2025).

Dataset Model Training Method Size

SynSQL-2.5M SQL Gen Model SFT 2,190,988

SynSQL-Think-916K SQL Gen Model SFT 916,156

BIRD Train SQL Gen Model RL (GRPO) 9,428

SynSQL-Merge-Think-310K SQL Merge Model SFT 310,764

BIRD Merge Train SQL Merge Model RL (GRPO) 7,159

2. Data Characteristics and Schema Coverage

SynSQL-Think-916K enforces strict data regularity and logical diversity across its examples:

SQL Query Content: Every instance is a SELECT query. Secondary SQL constructs (e.g., JOIN, aggregation functions like SUM, COUNT) appear in ~35% of cases, reflecting a range from simple retrievals to moderate compositional complexity.

Schema Diversity: The dataset amalgamates schemas from SynSQL-2.5M, subsuming both Spider and BIRD references alongside additional synthetic schemas. Data types encompass numerics, strings, dates, and booleans.

Complexity Distribution: Approximately 60% of samples correspond to single-table queries; 40% address multi-table logic. Aggregations and ranking constructs, notably ORDER BY ... [DESC](https://www.emergentmind.com/topics/deep-ensemble-shape-calibration-desc) [LIMIT](https://www.emergentmind.com/topics/limit), occur in roughly 25% of cases.

Quality Control: The pipeline guarantees uniqueness and format validity for each SQL/CoT example through automated filtering; however, no further manual annotation alters the content. The quality of reasoning is directly inherited from upstream OmniSQL-generated CoT steps, which are known for their logical coherence (Sheng et al., 30 Jul 2025).

3. Model Training Regimen and Objective Functions

SLM training with SynSQL-Think-916K proceeds via multi-phase objectives and optimization strategies:

Supervised Fine-Tuning (SFT): The central goal is to minimize the cross-entropy loss across the concatenated CoT and SQL output:

$\mathcal{L}_\mathrm{SFT} = -\sum_{t=1}^T \log p_\theta(y_t \mid y_{1:t-1}, x)$

SFT is performed with a learning rate of $2 \times 10^{-5}$ , linear decay over 2 epochs, a warm-up for the initial 10% of steps, and an effective batch size of 1,024 tokens.

Reinforcement Learning (RL) Post-Training: Post-SFT, models are further refined using Group Relative Policy Optimization (GRPO), with training on the BIRD dataset. The reward structure comprises:

Execution Accuracy:

$R_\mathrm{EX} = \begin{cases} 1 & \text{if SQL executes correctly} \ 0 & \text{otherwise} \end{cases}$ - Format Reward:

$R_\mathrm{Format} = \begin{cases} 1 & \text{if output matches required format} \ 0 & \text{otherwise} \end{cases}$ - The total reward is $R = R_\mathrm{EX} + 0.1 R_\mathrm{Format}$ , optimized via clipped importance ratios in the policy gradient objective:

$\max_\theta\,\mathbb{E}_{y\sim\pi_\theta}[R(y)]$

4. Inference Pipeline and Corrective Self-Consistency

SLM-SQL inference over SynSQL-Think-916K proceeds in two stages:

SQL Generation and Majority Selection: The model generates $N=64$ candidate CoT + SQL outputs, grouping them by denotation—a set of SQL outputs that result in the same execution result on the database. If one group forms a strict majority (over $N/2$ ), its SQL candidate is selected.

Merge Revision Model: Absent a majority, an auxiliary SQL Merge model is prompted with $M=8$ samples. The self-consistency correction cycle re-executes each and chooses the denotation group with maximal frequency.

Probabilistic Selection Formalism:

$\hat{y} = \arg\max_y \sum_{i} \mathbb{I}\{ y_i\,\text{denotation}=y \}$

This corrective self-consistency approach robustly resolves outlier generations and aligns final SQL predictions with denotational correctness (Sheng et al., 30 Jul 2025).

5. Empirical Evaluation and Impact

Experimental results on the BIRD development set demonstrate the efficacy of SynSQL-Think-916K in SLM-SQL models:

Execution Accuracy (EX):

Qwen2.5-Coder-0.5B-Instruct: 56.87%

Qwen2.5-Coder-1.5B-Instruct: 67.08%

Ablation Results:

Model w/o SFT+RL with SFT+RL+CSC

0.5B 22.14% 56.87%

1.5B 52.26% 67.08%

SFT on SynSQL-Think-916K yields a +21.9 percentage point improvement for 0.5B models and +8.9 for 1.5B; RL post-training adds +5 and +4 points, respectively. Corrective self-consistency provides a further ≈5 point gain. Cost analysis at 64 samples per inference yields EX of 67.08% at a cost of ~$0.00046 per question.

Comparative Significance: These improvements demonstrate that high-quality synthetic CoT data enable SLMs to substantially close the gap, in both logical reasoning and execution accuracy, relative to far larger LLMs on BIRD and Spider (Sheng et al., 30 Jul 2025).

6. Role in the Broader Text-to-SQL and SynSQL Ecosystem

While SynSQL-Think-916K is leveraged for SLM generation and logical training, the broader SynSQL framework (Habibollah et al., 29 Apr 2026) introduces a distinct avenue: robust, schema-consistent database synthesis for stress-testing Text-to-SQL systems. The design of SynSQL-Think-916K, with diversity in SQL logic and schema coverage, provides a backbone for discriminative model development, complementing efforts to evaluate true semantic generalizability in Text-to-SQL tasks.

Notably, the curation principles in SynSQL-Think-916K—emphasizing unique valid SQL, coverage of compositional complexity, and inheritance of high-quality CoT—directly address the core challenge identified by SynSQL: ensuring that model performance reflects true semantic competence rather than overfitting to single-instance database artifacts.

7. Limitations and Directions for Future Research

Key limitations of SynSQL-Think-916K include its exclusive focus on SELECT queries, lack of coverage for schema operations such as INSERT or UPDATE, and finite schema diversity compared to enterprise-scale databases. Future directions include extending the corpus to cover broader SQL operations, curating harder, more diverse schemas, and integrating semi-supervised or active learning paradigms to minimize reliance on synthetic CoT.

In summary, SynSQL-Think-916K is pivotal in advancing the capabilities of compact LLMs in Text-to-SQL reasoning and generation. Its role as the foundation for SLM-SQL supervised fine-tuning enables SLMs to approximate, and in some metrics rival, much larger models under rigorous benchmark conditions (Sheng et al., 30 Jul 2025).

Dataset	Model	Training Method	Size
SynSQL-2.5M	SQL Gen Model	SFT	2,190,988
SynSQL-Think-916K	SQL Gen Model	SFT	916,156
BIRD Train	SQL Gen Model	RL (GRPO)	9,428
SynSQL-Merge-Think-310K	SQL Merge Model	SFT	310,764
BIRD Merge Train	SQL Merge Model	RL (GRPO)	7,159

Model	w/o SFT+RL	with SFT+RL+CSC
0.5B	22.14%	56.87%
1.5B	52.26%	67.08%

Markdown Report Issue Upgrade to Chat

References (2)

SLM-SQL: An Exploration of Small Language Models for Text-to-SQL (2025)

SynSQL: Synthesizing Relational Databases for Robust Evaluation of Text-to-SQL Systems (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SynSQL-Think-916K.

SynSQL-Think-916K: A Synthetic Text-to-SQL Dataset

1. Dataset Genesis and Construction Pipeline

2. Data Characteristics and Schema Coverage

3. Model Training Regimen and Objective Functions

4. Inference Pipeline and Corrective Self-Consistency

5. Empirical Evaluation and Impact

6. Role in the Broader Text-to-SQL and SynSQL Ecosystem

7. Limitations and Directions for Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics