Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation (2403.02951v2)
Abstract: LLMs have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods. Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt templates and design frameworks. Additionally, existing benchmarks inadequately explore the performance of LLMs across the various sub-tasks of the Text-to-SQL process, which hinders the assessment of LLMs' cognitive capabilities and the optimization of LLM-based solutions. To address the aforementioned issues, we firstly construct a new dataset designed to mitigate the risk of overfitting in LLMs. Then we formulate five evaluation tasks to comprehensively assess the performance of diverse methods across various LLMs throughout the Text-to-SQL process.Our study highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. These findings offer valuable insights for enhancing the development of LLM-based Text-to-SQL systems.
- Sadga: Structure-aware dual graph aggregation network for text-to-sql. Advances in Neural Information Processing Systems, 34:7664–7676, 2021.
- Lgesql: line graph enhanced text-to-sql model with mixed local and non-local relations. arXiv preprint arXiv:2106.01093, 2021.
- Selective demonstrations for cross-domain text-to-sql. arXiv preprint arXiv:2310.06302, 2023.
- Teaching large language models to self-debug. arXiv preprint arXiv:2304.05128, 2023.
- Ryansql: Recursively applying sketch-based slot fillings for complex text-to-sql in cross-domain databases. Computational Linguistics, 47(2):309–332, 2021.
- Structure-grounded pretraining for text-to-sql. arXiv preprint arXiv:2010.12773, 2020.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
- Li Dong and Mirella Lapata. Coarse-to-fine decoding for neural semantic parsing. arXiv preprint arXiv:1805.04793, 2018.
- C3: Zero-shot text-to-sql with chatgpt. arXiv preprint arXiv:2307.07306, 2023.
- Towards robustness of text-to-sql models against synonym substitution. arXiv preprint arXiv:2106.01065, 2021a.
- Exploring underexplored limitations of cross-domain text-to-sql generalization. arXiv preprint arXiv:2109.05157, 2021b.
- Measuring and improving compositional generalization in text-to-sql via component alignment. arXiv preprint arXiv:2205.02054, 2022.
- Text-to-sql empowered by large language models: A benchmark evaluation. arXiv preprint arXiv:2308.15363, 2023.
- Prompting gpt-3.5 for text-to-sql with de-semanticization and skeleton retrieval. In Pacific Rim International Conference on Artificial Intelligence, pages 262–274. Springer, 2023.
- S22{{}^{2}}start_FLOATSUPERSCRIPT 2 end_FLOATSUPERSCRIPTsql: Injecting syntax to question-schema interaction graph encoder for text-to-sql parsers. arXiv preprint arXiv:2203.06958, 2022.
- A comprehensive exploration on wikisql with table-aware word contextualization. arXiv preprint arXiv:1902.01069, 2019.
- A survey on deep learning approaches for text-to-sql. The VLDB Journal, pages 1–32, 2023.
- Tptu-v2: Boosting task planning and tool usage of large language model-based agents in real-world systems. arXiv preprint arXiv:2311.11315, 2023.
- Deep learning driven natural languages text to sql query conversion: A survey. arXiv preprint arXiv:2208.04415, 2022.
- Kaggledbqa: Realistic evaluation of text-to-sql parsers. arXiv preprint arXiv:2106.11455, 2021.
- Resdsql: Decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13067–13075, 2023a.
- Dir: A large-scale dialogue rewrite dataset for cross-domain conversational text-to-sql. Applied Sciences, 13(4):2262, 2023b.
- Graphix-t5: Mixing pre-trained transformers with graph-aware layers for text-to-sql parsing. arXiv preprint arXiv:2301.07507, 2023c.
- Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. Advances in Neural Information Processing Systems, 36, 2024.
- Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
- Bridging textual and tabular data for cross-domain text-to-sql semantic parsing. arXiv preprint arXiv:2012.12627, 2020.
- A comprehensive evaluation of chatgpt’s zero-shot text-to-sql capability. arXiv preprint arXiv:2303.13547, 2023.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
- Hybrid ranking network for text-to-sql. arXiv preprint arXiv:2008.04759, 2020.
- Relation-aware graph transformer for sql-to-text generation. Applied Sciences, 12(1):369, 2021.
- Extensible/rule based query rewrite optimization in starburst. ACM Sigmod Record, 21(2):39–48, 1992.
- Evaluating cross-domain text-to-sql models and benchmarks. arXiv preprint arXiv:2310.18538, 2023.
- Din-sql: Decomposed in-context learning of text-to-sql with self-correction. Advances in Neural Information Processing Systems, 36, 2024.
- Rasat: Integrating relational structures into pretrained seq2seq model for text-to-sql. arXiv preprint arXiv:2205.06983, 2022.
- A survey on text-to-sql parsing: Concepts, methods, and future directions. arXiv preprint arXiv:2208.13629, 2022.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950, 2023.
- Tptu: Task planning and tool usage of large language model-based ai agents. arXiv preprint arXiv:2308.03427, 2023.
- Picard: Parsing incrementally for constrained auto-regressive decoding from language models. arXiv preprint arXiv:2109.05093, 2021.
- Compositional generalization and natural language variation: Can a semantic parsing approach handle both? arXiv preprint arXiv:2010.12725, 2020.
- Reflexion: an autonomous agent with dynamic memory and self-reflection. arXiv e-prints, pages arXiv–2303, 2023.
- Logic-consistency text generation from semantic parses. arXiv preprint arXiv:2108.00577, 2021.
- Reboost large language model-based text-to-sql, text-to-python, and text-to-function–with real applications in traffic domain. arXiv preprint arXiv:2310.18752, 2023.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Rat-sql: Relation-aware schema encoding and linking for text-to-sql parsers. arXiv preprint arXiv:1911.04942, 2019.
- Mac-sql: Multi-agent collaboration for text-to-sql. arXiv preprint arXiv:2312.11242, 2023a.
- Dbcopilot: Scaling natural language querying to massive databases. arXiv preprint arXiv:2312.03463, 2023b.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837, 2022.
- Sead: End-to-end text-to-sql generation with schema-aware denoising. arXiv preprint arXiv:2105.07911, 2021.
- Sql-to-text generation with graph-to-sequence model. arXiv preprint arXiv:1809.05255, 2018.
- Sqlnet: Generating structured queries from natural language without reinforcement learning. arXiv preprint arXiv:1711.04436, 2017.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629, 2022.
- A syntactic neural model for general-purpose code generation. arXiv preprint arXiv:1704.01696, 2017.
- Tabert: Pretraining for joint understanding of textual and tabular data. arXiv preprint arXiv:2005.08314, 2020.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887, 2018.
- Controlling large language model-based agents for large-scale decision-making: An actor-critic approach. arXiv preprint arXiv:2311.13884, 2023a.
- Act-sql: In-context learning for text-to-sql with automatically-generated chain-of-thought. arXiv preprint arXiv:2310.17342, 2023b.
- Bertscore: Evaluating text generation with bert. arXiv preprint arXiv:1904.09675, 2019.
- Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103, 2017.
- Bin Zhang (227 papers)
- Yuxiao Ye (6 papers)
- Guoqing Du (5 papers)
- Xiaoru Hu (4 papers)
- Zhishuai Li (16 papers)
- Sun Yang (7 papers)
- Chi Harold Liu (43 papers)
- Rui Zhao (241 papers)
- Ziyue Li (68 papers)
- Hangyu Mao (37 papers)