Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

OmniSQL: Synthesizing High-quality Text-to-SQL Data at Scale (2503.02240v1)

Published 4 Mar 2025 in cs.CL and cs.DB

Abstract: Text-to-SQL, the task of translating natural language questions into SQL queries, plays a crucial role in enabling non-experts to interact with databases. While recent advancements in LLMs have significantly enhanced text-to-SQL performance, existing approaches face notable limitations in real-world text-to-SQL applications. Prompting-based methods often depend on closed-source LLMs, which are expensive, raise privacy concerns, and lack customization. Fine-tuning-based methods, on the other hand, suffer from poor generalizability due to the limited coverage of publicly available training data. To overcome these challenges, we propose a novel and scalable text-to-SQL data synthesis framework for automatically synthesizing large-scale, high-quality, and diverse datasets without extensive human intervention. Using this framework, we introduce SynSQL-2.5M, the first million-scale text-to-SQL dataset, containing 2.5 million samples spanning over 16,000 synthetic databases. Each sample includes a database, SQL query, natural language question, and chain-of-thought (CoT) solution. Leveraging SynSQL-2.5M, we develop OmniSQL, a powerful open-source text-to-SQL model available in three sizes: 7B, 14B, and 32B. Extensive evaluations across nine datasets demonstrate that OmniSQL achieves state-of-the-art performance, matching or surpassing leading closed-source and open-source LLMs, including GPT-4o and DeepSeek-V3, despite its smaller size. We release all code, datasets, and models to support further research.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (12)
  1. Haoyang Li (95 papers)
  2. Shang Wu (22 papers)
  3. Xiaokang Zhang (42 papers)
  4. Xinmei Huang (6 papers)
  5. Jing Zhang (731 papers)
  6. Fuxin Jiang (6 papers)
  7. Shuai Wang (466 papers)
  8. Tieying Zhang (19 papers)
  9. Jianjun Chen (52 papers)
  10. Rui Shi (76 papers)
  11. Hong Chen (230 papers)
  12. Cuiping Li (42 papers)