Six-Gym: LLM SQL Debugging Framework

Updated 8 July 2025

Six-Gym is an open-source environment designed to train and benchmark large language models for debugging complex SQL queries.
It employs SQL-Rewind to automatically generate realistic SQL bug datasets by perturbing correct queries into plausible error cases.
The framework leverages f-Plan Boosting and the Bird-Fixer agent to enhance debugging trajectories and improve model reliability by over 70%.

Six-Gym (Sql-fIX-Gym) is an open-source training environment and agentic framework developed to advance the capabilities of LLMs in the field of SQL issue debugging. It provides methods for systematic and reproducible evaluation, training, and benchmarking of LLM-based SQL debugging agents through automated, realistic, and scalable pipelines. The design and methodology address critical challenges in the debugging of complex SQL queries in practical, multi-dialect scenarios, supporting not only post-hoc bug repair but also improving overall reliability and privacy in database-centric workflows (Li et al., 23 Jun 2025).

1. Concept and Objectives

Six-Gym (Sql-fIX-Gym) is positioned as an LLM training and evaluation environment tailored for SQL debugging, targeting both the complexity of real-world SQL issues and the need for open-source, privacy-preserving solutions. It synthesizes authentic user-reported issues and constructs executable issue-solution datasets via reverse engineering, enabling robust model fine-tuning, trajectory learning, and automatic agentic evaluation (Li et al., 23 Jun 2025). The overarching objectives include:

Scaling SQL debugging coverage across dialects (e.g., PostgreSQL, MySQL, SQLite).
Automating and enriching supervision signals for LLM training.
Providing a reproducible leaderboard and benchmarking platform.
Enabling local, open-source model development to preserve sensitive data on-premises.

2. Data Generation via SQL-Rewind

A central contribution of Six-Gym is the SQL-Rewind strategy, which allows automatic synthesis of realistic SQL bug datasets by reversing correct queries into plausible erroneous versions (Li et al., 23 Jun 2025). The process involves:

Starting from a verified and executable SQL solution and generating issue SQL queries by perturbing structural or logical elements.
Utilizing advanced LLMs (such as Gemini-2.0-Flash) to introduce errors and generate corresponding evaluation scripts.
Ensuring that every generated issue-SQL is paired with its canonical solution, enabling precise mapping for supervised learning.

This method avoids the prohibitive cost of manual annotation and maintains high fidelity to the kinds of bugs actually encountered in production environments, forming the backbone of the training and evaluation data in Six-Gym.

3. Trajectory Enrichment with f-Plan Boosting

To maximize supervisory signal for agentic LLM training, Six-Gym employs the f-Plan Boosting mechanism (Li et al., 23 Jun 2025). This approach goes beyond simple trajectory collection by augmenting debugging episodes as follows:

In the backward inference phase, a teacher LLM annotates each solution with a symbolic “functional plan” (f-plan) that outlines the abstract sequence of corrective operations needed to resolve the issue (e.g., “adjust aggregation,” “fix join predicate”).
In the forward validation phase, these f-plans are used to guide an agent scaffold (SQL-Act), ensuring that replayed or regenerated debugging sequences consistently produce the correct solution.
The feedback loop of backward annotation and forward validation increases the number and diversity of successful debugging trajectories by 73.7% compared to standard methods, supplying richer data for fine-tuning and evaluation.

This self-distillation pipeline supports not only outcome-level supervision (was the bug fixed?) but also stepwise, interpretable guidance at the plan and agent action levels.

4. Bird-Fixer: Agentic Debugging Framework

Six-Gym integrates its methodologies into Bird-Fixer, an open-source agent developed for structured SQL issue debugging (Li et al., 23 Jun 2025). Key components include:

The SQL-Act scaffold, which treats SQL statements as discrete “actions” within an iterative reasoning and repair process. Each agent step outputs a triple comprising internal reasoning, a candidate SQL action, and the resulting observation (typically query output or error message).
Generative Thought Mode (GTM), a mechanism that explicitly separates agent “thoughts” (reasoning and planning) from SQL code generation steps. This decoupling reduces overfitting to SQL syntax and leverages broader LLM code knowledge during training.
Reproducible fine-tuning pipelines, which allow open-source models like Qwen-2.5-Coder-14B to be trained and evaluated on Six-Gym’s synthetic and real-world debugging tasks, achieving state-of-the-art results among both proprietary and open-source systems.

The Bird-Fixer agent demonstrates competitive performance on the BIRD-CRITIC benchmark, achieving 38.11% success on PostgreSQL issues and 29.65% on multi-dialect debugging scenarios, exceeding leading proprietary LLMs such as Claude-3.7-Sonnet and GPT-4.1, thereby validating the pipeline’s effectiveness.

5. Technical Formulations and Evaluation

Six-Gym formalizes the SQL issue debugging process as a mapping from a problem triplet—natural language description ( $\mathcal{P}$ ), database schema ( $\mathcal{S}$ ), and buggy SQL query ( $\sigma_\text{issue}$ )—to a corrected SQL query ( $\sigma_\text{pred}$ ) via an LLM agent $f_\theta$ :

$\sigma_\text{pred} = f_\theta(\mathcal{P}, \mathcal{S}, \sigma_\text{issue})$

Each debugging task includes custom test cases ( $T$ ), used to programmatically verify if the predicted SQL passes all solution criteria. Aggregate performance is measured by the overall success rate (SR):

$\text{SR} = \frac{1}{N} \sum_{i=1}^N \mathbf{I} \left[ T_i(\sigma_\text{pred,}i) = \text{True} \right]$

where $\mathbf{I}$ is the indicator function.

Agentic evaluation is enhanced by tracking explicit trajectories:

$\tau = ((t_1, \sigma_1, o_1),\, \ldots,\, (t_n, \sigma_n, o_n))$

with $t_j$ representing reasoning, $\sigma_j$ SQL actions, and $o_j$ environment observations. The f-Plan Boosting protocol supports the annotation and filtering of trajectories to improve the quality and interpretability of the fine-tuning signal.

6. Implications for Open-Source and Practical Development

Six-Gym addresses several pressing needs in SQL debugging and database development:

It facilitates local, privacy-sensitive fine-tuning of LLMs for SQL workloads, circumventing cloud-based or proprietary toolchains.
Automated reverse engineering of debugging data supports faster coverage of rare and emergent SQL issue patterns.
The modular and adaptable training environment supports benchmarking across multiple SQL dialects and real-world user issue distributions, as epitomized by the BIRD-CRITIC-PG and BIRD-CRITIC-Multi benchmarks (Li et al., 23 Jun 2025).

A plausible implication is that as model, dataset, and trajectory-building methodologies mature within such environments, the community will be able to construct powerful, portable debugging agents deployable across enterprise, scientific, and operational data stacks.

7. Connections to Broader SQL Evaluation and Bug-Fixing Ecosystems

Six-Gym sits at the intersection of automated SQL evaluation, bug fixing, and large-scale system testing. Its approach complements recent advances in:

Adaptive SQL test generation frameworks for DBMS logic bug discovery, such as SQLancer++ (Zhong et al., 27 Mar 2025);
Data construction and fine-tuning methodology for repairing SQL via LLMs (see PDC and DM-SFT) (Duan et al., 11 Nov 2024);
Graph-based SQL semantics evaluation, including functional correctness metrics that go beyond surface-level syntactic matching (Zhan et al., 9 Jul 2024).

The pipeline in Six-Gym extends these traditions by focusing explicitly on reproducible, agentic debugging, with open benchmarks and agents accessible to the research and practitioner communities.

Six-Gym (Sql-fIX-Gym) represents a comprehensive environment for benchmarking, training, and deploying SQL debugging agents grounded in state-of-the-art data synthesis, plan-based supervision, and open-source accessibility. Its modular methodologies, evaluation protocols, and public leaderboards offer a foundation for ongoing research and application in robust SQL debugging using LLMs.