Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems (2306.04743v2)

Published 7 Jun 2023 in cs.DB, cs.AI, and cs.CL

Abstract: Natural Language to SQL systems (NL-to-SQL) have recently shown a significant increase in accuracy for natural language to SQL query translation. This improvement is due to the emergence of transformer-based LLMs, and the popularity of the Spider benchmark - the de-facto standard for evaluating NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%. However, Spider mainly contains simple databases with few tables, columns, and entries, which does not reflect a realistic setting. Moreover, complex real-world databases with domain-specific content have little to no training data available in the form of NL/SQL-pairs leading to poor performance of existing NL-to-SQL systems. In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL benchmark for three real-world, highly domain-specific databases. For this new benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for each domain. To garner more data, we extended the small amount of human-generated data with synthetic data generated using GPT-3. We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark. Thus, the challenge is many-fold: creating NL-to-SQL systems for highly complex domains with a small amount of hand-made training data augmented with synthetic data. To our knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with complex real-world scientific databases, containing challenging training and test data carefully validated by domain experts.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. A Comparative Survey of Recent Natural Language Interfaces for Databases. The VLDB Journal 28, 5 (oct 2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8
  2. INODE: building an end-to-end data exploration system in practice. ACM SIGMOD Record 50, 4 (2022), 23–29.
  3. Natural language interfaces to databases – an introduction. Natural Language Engineering 1, 1 (1995), 29–81. https://doi.org/10.1017/S135132490000005X
  4. Natural language interfaces to databases - an introduction. Nat. Lang. Eng. 1, 1 (1995), 29–81. https://doi.org/10.1017/S135132490000005X
  5. Soda: Generating sql for business users. arXiv preprint arXiv:1207.0134 (2012).
  6. Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
  7. Ursin Brunner and Kurt Stockinger. 2021. Valuenet: A natural language-to-sql system that learns from database information. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2177–2182.
  8. A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 897–911. https://doi.org/10.18653/v1/2020.acl-main.84
  9. Improving Text-to-SQL Evaluation Methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 351–360. https://doi.org/10.18653/v1/P18-1033
  10. Towards Robustness of Text-to-SQL Models against Synonym Substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2505–2515. https://doi.org/10.18653/v1/2021.acl-long.195
  11. Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 2030–2042. https://doi.org/10.18653/v1/2021.findings-emnlp.174
  12. Question Generation from SQL Queries Improves Neural Semantic Parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 1597–1607. https://doi.org/10.18653/v1/D18-1188
  13. Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4524–4535. https://doi.org/10.18653/v1/P19-1444
  14. Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. https://doi.org/10.48550/ARXIV.2106.05006
  15. Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 77–87. https://doi.org/10.18653/v1/2021.nlp4prog-1.9
  16. Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022).
  17. A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization. https://doi.org/10.48550/ARXIV.1902.01069
  18. Neural Approaches for Natural Language Interfaces to Databases: A Survey. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 381–395. https://doi.org/10.18653/v1/2020.coling-main.34
  19. Learning a Neural Semantic Parser from User Feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, 963–973.
  20. George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal (2023). https://doi.org/10.1007/s00778-022-00776-8
  21. KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2261–2273. https://doi.org/10.18653/v1/2021.acl-long.176
  22. Fei Li and H. V. Jagadish. 2014a. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB 8, 1 (Sept. 2014), 73–84.
  23. Fei Li and Hosagrahar V Jagadish. 2014b. NaLIR: an interactive natural language interface for querying relational databases. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 709–712.
  24. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 13067–13075. https://doi.org/10.1609/aaai.v37i11.26535
  25. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111 [cs.CL]
  26. Pingchuan Ma and Shuai Wang. 2021. MT-Teql: Evaluating and Augmenting Neural NLIDB on Real-World Linguistic and Schema Variations. Proc. VLDB Endow. 15, 3 (nov 2021), 569–582. https://doi.org/10.14778/3494124.3494139
  27. Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
  28. Towards a Theory of Natural Language Interfaces to Databases. In Proceedings of the 8th International Conference on Intelligent User Interfaces (Miami, Florida, USA) (IUI ’03). Association for Computing Machinery, New York, NY, USA, 149–157. https://doi.org/10.1145/604045.604070
  29. Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 186–191. https://doi.org/10.18653/v1/W18-6319
  30. ”Language Models are Unsupervised Multitask Learners”. (2019).
  31. Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
  32. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
  33. Roland Roller and Mark Stevenson. 2015. Making the most of limited training data using distant supervision. In Proceedings of BioNLP 15. Association for Computational Linguistics, Beijing, China, 12–20. https://doi.org/10.18653/v1/W15-3802
  34. Centroid-based Text Summarization through Compositionality of Word Embeddings. In Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres. Association for Computational Linguistics, Valencia, Spain, 12–21. https://doi.org/10.18653/v1/W17-1003
  35. Ohad Rubin and Jonathan Berant. 2021. SmBoP: Semi-autoregressive Bottom-up Semantic Parsing. In Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021). Association for Computational Linguistics, Online, 12–21. https://doi.org/10.18653/v1/2021.spnlp-1.2
  36. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209–1220.
  37. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv:2109.05093 [cs.CL]
  38. ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13, 11 (2020), 2747–2759.
  39. Précis: from unstructured keywords as queries to structured databases as answers. The VLDB Journal 17, 1 (2008), 117–149.
  40. The SDSS skyserver: public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 570–581.
  41. Lappoon R. Tang and Raymond J. Mooney. 2000. Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Association for Computational Linguistics, Hong Kong, China, 133–141. https://doi.org/10.3115/1117794.1117811
  42. Attention is all you need. Advances in neural information processing systems 30 (2017).
  43. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. arXiv:1911.04942 [cs.CL]
  44. Text-to-SQL Generation for Question Answering on Electronic Medical Records. In Proceedings of The Web Conference 2020. 350–361.
  45. DBPal: A Fully Pluggable NL2SQL Training Pipeline. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2347–2361. https://doi.org/10.1145/3318464.3380589
  46. Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8974–8983. https://doi.org/10.18653/v1/2021.emnlp-main.707
  47. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv:1711.04436 [cs.CL]
  48. SQLizer: Query Synthesis from Natural Language. PACMPL, Article 63 (2017), 26 pages.
  49. GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In International Conference on Learning Representations. https://arxiv.org/abs/2009.13845
  50. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.18653/v1/D18-1425
  51. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 [cs.CL]
  52. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv:1709.00103 [cs.CL]
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Yi Zhang (994 papers)
  2. Jan Deriu (21 papers)
  3. George Katsogiannis-Meimarakis (1 paper)
  4. Catherine Kosten (4 papers)
  5. Georgia Koutrika (6 papers)
  6. Kurt Stockinger (22 papers)
Citations (12)

Summary

We haven't generated a summary for this paper yet.