SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications (2506.18951v1)

Published 23 Jun 2025 in cs.DB and cs.AI

Abstract: Resolution of complex SQL issues persists as a significant bottleneck in real-world database applications. Current LLMs, while adept at text-to-SQL translation, have not been rigorously evaluated on the more challenging task of debugging SQL issues. To address this gap, we introduce BIRD-CRITIC, a new SQL issue debugging benchmark comprising 530 PostgreSQL tasks (BIRD-CRITIC-PG) and 570 multi-dialect tasks (BIRD-CRITIC-Multi), distilled from authentic user issues and replayed within new environments to facilitate rigorous evaluation. Baseline evaluations underscore the task's complexity, with the leading reasoning model O3-Mini achieving only 38.87% success rate on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi. Meanwhile, advancing open-source models for database tasks is crucial for empowering local development while safeguarding data privacy. Therefore, we present Six-Gym (Sql-fIX-Gym), a training environment for elevating open-source model capabilities for SQL issue debugging. This environment leverages SQL-Rewind strategy, which automatically generates executable issue-solution datasets by reverse-engineering issues from verified SQLs. However, popular trajectory-based fine-tuning methods do not explore substantial supervisory signals. We further propose f-Plan Boosting, which extracts high-level debugging plans from SQL solutions, enabling teacher LLMs to produce 73.7% more successful trajectories for training. We integrate these components into an open-source agent, Bird-Fixer. Based on Qwen-2.5-Coder-14B, Bird-Fixer achieves 38.11% success rate on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, surpassing leading proprietary models such as Claude-3.7-Sonnet and GPT-4.1, marking a significant step toward democratizing sophisticated SQL-debugging capabilities. The leaderboard and source code are available: https://bird-critic.github.io/

PDF Abstract

SWE-SQL: Advancing LLM-Based SQL Issue Debugging in Real-World Applications

The paper "SWE-SQL: Illuminating LLM Pathways to Solve User SQL Issues in Real-World Applications" (Li et al., 23 Jun 2025 ) addresses a critical gap in the evaluation and development of LLMs for SQL debugging, moving beyond the well-studied text-to-SQL generation task. The authors introduce a comprehensive benchmark, BIRD-CRITIC, and a novel training environment, Six-Gym, to systematically assess and improve LLMs' ability to resolve authentic SQL issues encountered in production environments. The work culminates in Bird-Fixer, an open-source agent that demonstrates competitive performance with proprietary models, highlighting the feasibility of democratizing advanced SQL debugging capabilities.

Benchmarking SQL Debugging: BIRD-CRITIC

BIRD-CRITIC is constructed from real-world SQL issues, primarily sourced from StackOverflow, and is divided into two subsets: BIRD-CRITIC-PG (530 PostgreSQL tasks) and BIRD-CRITIC-Multi (570 tasks across PostgreSQL, MySQL, SQL Server, and Oracle). Each task comprises a user issue description, a faulty SQL query, and the relevant database schema. The benchmark is distinguished by:

Authenticity and Diversity: Tasks are distilled from genuine user reports, encompassing query-like, data management, and personalization issues.
Rigorous Evaluation: Each task is paired with custom evaluation scripts and test cases, enabling functional correctness assessment beyond simple execution or syntactic matching.
Dialect Coverage: Multi-dialect support ensures that models are evaluated on the heterogeneity present in real-world database systems.

Baseline evaluations reveal the inherent difficulty of the task: the best-performing reasoning model (O3-Mini) achieves only 38.87% success on BIRD-CRITIC-PG and 33.33% on BIRD-CRITIC-Multi, underscoring the complexity of SQL debugging compared to text-to-SQL translation.

Automated Training Environment: Six-Gym and SQL-Rewind

To address the scarcity of high-quality training data for SQL debugging, the authors propose Six-Gym, an automated environment leveraging the SQL-Rewind strategy. This approach systematically generates issue-solution pairs by:

Reverse Engineering: Starting from verified correct SQL queries, plausible errors are introduced to synthesize realistic debugging scenarios.
Automated Validation: LLMs (e.g., Gemini-2.0-Flash) are used to generate and validate issue descriptions, faulty SQL, and evaluation scripts, ensuring coherence and correctness.
Data Scale: The pipeline produces over 3,300 high-quality synthetic debugging instances, facilitating scalable training without manual annotation.

This environment enables the development of LLM agents that can learn from diverse, executable debugging trajectories, a prerequisite for robust SQL issue resolution.

Agentic Debugging: SQL-Act and f-Plan Boosting

The paper introduces SQL-Act, an agent scaffold inspired by ReAct, where the LLM iteratively emits (thought, SQL action, observation) tuples. Unlike tool-based agents with limited action spaces, SQL-Act allows arbitrary SQL commands, providing the flexibility required for complex debugging.

A key innovation is f-Plan Boosting, a two-phase self-distillation process:

Backward Inference: The teacher LLM infers a high-level, stepwise functional plan mapping the faulty SQL to the correct solution.
Forward Validation: Guided by this plan, the agent regenerates the solution, retaining only those trajectories that pass all test cases.

This method increases the number of successful training trajectories by 73.7%, significantly enriching the supervision signal for fine-tuning open-source models.

Additionally, the Generative Thought Mode (GTM) decouples the generation of reasoning steps and SQL actions, mitigating overfitting and leveraging the base model's broad SQL knowledge for action generation.

Empirical Results and Analysis

Extensive experiments demonstrate:

Reasoning Models Outperform General-Purpose LLMs: Models with explicit reasoning capabilities achieve higher success rates, particularly on complex query-like issues.
Bird-Fixer Surpasses Proprietary Models: Fine-tuned on Qwen-2.5-Coder-14B, Bird-Fixer attains 38.11% SR on BIRD-CRITIC-PG and 29.65% on BIRD-CRITIC-Multi, outperforming Claude-3.7-Sonnet and GPT-4.1.
Cross-Dialect Generalization: Despite being trained only on PostgreSQL, Bird-Fixer generalizes effectively to other SQL dialects, attributed to the GTM strategy.
Ablation Studies: Both f-Plan Boosting and GTM are critical for performance; their removal leads to substantial drops in success rate.
Error Analysis: The most common failure modes are incorrect logic (44.5%), projection mismatches, chained errors, and syntax errors, highlighting the need for deeper semantic understanding and robust reasoning.

Practical Implications

The methodologies and resources introduced have several practical implications:

Benchmarking: BIRD-CRITIC provides a rigorous, contamination-free benchmark for evaluating LLMs on realistic SQL debugging tasks, facilitating progress tracking and model comparison.
Training Open-Source Models: Six-Gym and f-Plan Boosting enable the development of competitive open-source SQL debugging agents, reducing reliance on proprietary solutions and enhancing data privacy.
Deployment: Bird-Fixer, with its moderate parameter count (7–14B), can be deployed locally, making it suitable for privacy-sensitive environments and organizations with limited computational resources.
Agentic Workflows: The SQL-Act scaffold and GTM approach can be adapted to other code debugging and program repair domains, promoting generalization and robustness.

Theoretical and Future Directions

The work highlights the limitations of current LLMs in handling the full spectrum of SQL debugging, especially in the presence of complex, diverse, and ambiguous user issues. The strong negative correlation between query diversity and model performance suggests that further advances in reasoning, context understanding, and interaction are required.

Future research directions include:

Workflow Integration: Extending agentic frameworks to handle external workflows (e.g., file operations) and multi-turn user interactions.
Interactive Debugging: Incorporating conversational and dynamic agent-user interactions to better simulate real-world debugging processes.
Broader Program Repair: Adapting the f-Plan and agentic strategies to other domains, such as multi-language program repair and data science pipelines.

Conclusion

This paper establishes a new standard for evaluating and developing LLMs for SQL debugging, providing both a challenging benchmark and practical methodologies for training effective agents. The demonstrated success of Bird-Fixer underscores the potential for open-source, privacy-preserving, and robust SQL debugging assistants, with broad applicability in both research and industry. The framework and insights presented are likely to inform future developments in agentic code reasoning and automated program repair.

PDF Markdown Bookmark Chat (Pro)

Authors (20)

Jinyang Li (67 papers)
Xiaolong Li (107 papers)
Ge Qu (7 papers)
Per Jacobsson (1 paper)
Bowen Qin (16 papers)
Binyuan Hui (57 papers)
Shuzheng Si (20 papers)
Nan Huo (20 papers)
Xiaohan Xu (9 papers)
Yue Zhang (620 papers)
Ziwei Tang (3 papers)
Yuanshuai Li (2 papers)
Florensia Widjaja (1 paper)
Xintong Zhu (1 paper)
Feige Zhou (1 paper)
Yongfeng Huang (110 papers)
Yannis Papakonstantinou (9 papers)
Fatma Ozcan (74 papers)
Chenhao Ma (21 papers)
Reynold Cheng (31 papers)

Related Papers

Find Related Papers

GitHub

Tweets

https://twitter.com/dongxi_nlp/status/1937868144829247938