SynSQL-Merge-Think-310K Dataset
- SynSQL-Merge-Think-310K is a synthetic SQL merge revision dataset designed to improve logical accuracy in small language models for Text-to-SQL tasks.
- It employs an automated multi-stage pipeline with candidate generation, execution grouping, and filtering to ensure high annotation quality and a 99% correct SQL selection rate in top groups.
- The dataset underpins a two-stage post-training process that significantly enhances execution accuracy in SLM-SQL systems using supervised fine-tuning and reinforcement learning.
SynSQL-Merge-Think-310K is a synthetic, large-scale dataset specifically constructed for the task of SQL merge revision, a key component in boosting logical accuracy for small LLMs (SLMs) in Text-to-SQL applications. Comprising 310,764 carefully curated merge-revision examples, the corpus forms the backbone of the second-stage post-training pipeline in SLM-SQL systems, where a model is tasked with selecting the correct SQL query from two candidates for a given natural-language question and database schema. The dataset is derived from the comprehensive SynSQL-2.5M corpus and leverages execution-grounded filtering and annotation to maintain supervisory quality suitable for supervised fine-tuning (SFT) and reinforcement learning (RL) regimes (Sheng et al., 30 Jul 2025).
1. Purpose and Functional Role
SynSQL-Merge-Think-310K serves as a synthetic “SQL merge revision” benchmark, aimed at training models to identify the correct SQL query among two plausible drafts for the same question and schema. This structure aligns with a corrective self-consistency (CSC) pipeline: diverse SQL hypotheses are generated for a question, grouped by execution results, and a merge model then adjudicates between the most frequent (i.e., highly agreed-upon) candidate groups. The goal is to enable SLMs—specifically those in the 0.5B to 1.5B parameter range—to perform robust SQL correctness checking, despite lacking the comprehensive logical reasoning capabilities of larger LLMs. The dataset underpins a significant increase in downstream Text-to-SQL execution accuracy (EX) when integrated into the SLM-SQL post-training framework (Sheng et al., 30 Jul 2025).
2. Construction Methodology
The dataset construction is an automatic, multi-stage pipeline extending from SynSQL-2.5M, which contains over 2.5 million NL↔SQL pairs annotated with chain-of-thought (CoT) reasoning. The derivation proceeds as follows:
- Preprocessing: Examples not containing “SELECT”, with repeated SQL fragments in their CoT, or with SQL comments (“--”) are filtered; chain-of-thought texts are truncated to a maximum of 7,000 tokens and SQL statements are enclosed with
<answer>…</answer>, with reasoning in> …</think>. > > 2. Candidate Generation: For each preprocessed example (SynSQL-Think-916K), eight SQL candidates are generated using the Qwen2.5-Coder-7B-Instruct model. > > 3. Execution-Based Grouping: All eight SQLs are executed and grouped by identical execution results. > > 4. Selection of Merge Pairs: The two groups with the highest vote counts per question are selected, and each pair forms one merge-revision example containing the corresponding SQL drafts and their execution results. > > 5. Quality Filtering: Only executable SQL queries are retained, leveraging group-voting to exclude non-functional or inconsistent drafts. Manual inspection indicated that the top-2 voted groups contain a correct SQL in at least 99% of cases. > > This process yields a dataset composed of both simple and complex schema instances, sampled from over 200 distinct databases spanning Spider and BIRD schema domains. No explicit manual schema stratification is applied (Sheng et al., 30 Jul 2025). > > ## 3. Dataset Composition and Format > > ### Dataset Structure and Splits > > The main dataset, SynSQL-Merge-Think-310K, comprises 310,764 examples for SFT. An additional, held-out set BIRD-Merge-Train (7,159 examples) supports RL-based post-training. > > | Dataset Name | Examples | > |------------------------------|----------| > | SynSQL-Merge-Think-310K (SFT)| 310,764 | > | BIRD-Merge-Train (RL) | 7,159 | > > ### Example Structure > > Each record is formatted in JSON-style, encapsulating the following fields: > > - schema: Serialized representation of a database schema. > > - question: The natural language query. > > - candidates: A list of two objects, each with ("sql": SQL string, "exec_result": list of resulting tuples). > > - gold: The correct (target) SQL among the candidates. > > - reasoning: The chain-of-thought, enclosed in<think> ....
Representative Example:
1 2 3 4 5 6 7 8 9 10 |
{
"schema": { ... },
"question": "List all schools ...",
"candidates": [
{ "sql": "SELECT ...", "exec_result": [[...], ...] },
{ "sql": "SELECT ...", "exec_result": [[...], ...] }
],
"gold": "SELECT ...",
"reasoning": "<think> ... </think>"
} |
Prompting for the merge-revision task includes presenting both candidate SQLs and execution results. The model then analyzes, compares, and selects the correct draft (Sheng et al., 30 Jul 2025).
4. Evaluation Metrics and Training Regimes
Execution Accuracy (EX):
EX is the sole metric for measuring merge-revision performance; there is no distinct merge-selection metric. The merge model is optimized during SFT via cross-entropy loss over the candidate labels. In the RL stage, training employs Group Relative Policy Optimization (GRPO) with a reward function:
where
Training Recommendations:
- Supervised Fine-Tuning: 1–2 epochs over SynSQL-Merge-Think-310K with cross-entropy objective.
- Reinforcement Learning: Further tuning on BIRD-Merge-Train with GRPO; reward driven primarily by execution accuracy plus a modest format reward.
- Best practice is to ensure all candidate SQLs are executable to minimize annotation noise.
Limitations include a strict two-candidate selection structure (not addressing multi-candidate merging or arbitrary SQL rewriting) (Sheng et al., 30 Jul 2025).
5. Integration into SLM-SQL Pipelines
Within SLM-SQL, the merge-revision corpus operationalizes corrective self-consistency. The workflow:
- The SQL Generation model, trained on SynSQL-Think-916K, samples multiple (e.g., 16) candidate SQLs for a question.
- Candidates are grouped by execution result; if a unanimous top-vote group arises, its representative SQL is output directly.
- If group agreement is ambiguous, the merge model—trained on SynSQL-Merge-Think-310K and BIRD-Merge-Train—selects the correct SQL between the two top-voted groups.
This decoupled, two-stage approach delivers substantial gains in execution accuracy on benchmark development sets (e.g., +31.4 average points across five SLMs; with the 0.5B SLM-SQL model yielding 56.87% EX and 1.5B model 67.08% EX on BIRD dev) (Sheng et al., 30 Jul 2025).
6. Significance and Impact
SynSQL-Merge-Think-310K represents an effective strategy for leveraging SLMs in Text-to-SQL systems by mitigating their logical reasoning deficits with execution-based filtering and supervised merge revision. The dataset's automated construction pipeline guarantees scalability, high annotation fidelity, and robust schema diversity by inheriting the breadth of Spider/BIRD database domains. The corrective self-consistency pipeline enabled by SynSQL-Merge-Think-310K allows SLMs to approach the logical accuracy levels previously only attainable by larger LLMs, making inference tractable for edge devices and resource-constrained settings (Sheng et al., 30 Jul 2025).
The corpus and supporting codebase are publicly released at https://github.com/CycloneBoy/slm_sql, facilitating broader adoption and reproducibility for research in execution-grounded Text-to-SQL modeling.