ScienceBenchmark: A Complex Real-World Benchmark for Evaluating Natural Language to SQL Systems (2306.04743v2)
Abstract: Natural Language to SQL systems (NL-to-SQL) have recently shown a significant increase in accuracy for natural language to SQL query translation. This improvement is due to the emergence of transformer-based LLMs, and the popularity of the Spider benchmark - the de-facto standard for evaluating NL-to-SQL systems. The top NL-to-SQL systems reach accuracies of up to 85\%. However, Spider mainly contains simple databases with few tables, columns, and entries, which does not reflect a realistic setting. Moreover, complex real-world databases with domain-specific content have little to no training data available in the form of NL/SQL-pairs leading to poor performance of existing NL-to-SQL systems. In this paper, we introduce ScienceBenchmark, a new complex NL-to-SQL benchmark for three real-world, highly domain-specific databases. For this new benchmark, SQL experts and domain experts created high-quality NL/SQL-pairs for each domain. To garner more data, we extended the small amount of human-generated data with synthetic data generated using GPT-3. We show that our benchmark is highly challenging, as the top performing systems on Spider achieve a very low performance on our benchmark. Thus, the challenge is many-fold: creating NL-to-SQL systems for highly complex domains with a small amount of hand-made training data augmented with synthetic data. To our knowledge, ScienceBenchmark is the first NL-to-SQL benchmark designed with complex real-world scientific databases, containing challenging training and test data carefully validated by domain experts.
- A Comparative Survey of Recent Natural Language Interfaces for Databases. The VLDB Journal 28, 5 (oct 2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8
- INODE: building an end-to-end data exploration system in practice. ACM SIGMOD Record 50, 4 (2022), 23–29.
- Natural language interfaces to databases – an introduction. Natural Language Engineering 1, 1 (1995), 29–81. https://doi.org/10.1017/S135132490000005X
- Natural language interfaces to databases - an introduction. Nat. Lang. Eng. 1, 1 (1995), 29–81. https://doi.org/10.1017/S135132490000005X
- Soda: Generating sql for business users. arXiv preprint arXiv:1207.0134 (2012).
- Language models are few-shot learners. Advances in neural information processing systems 33 (2020), 1877–1901.
- Ursin Brunner and Kurt Stockinger. 2021. Valuenet: A natural language-to-sql system that learns from database information. In 2021 IEEE 37th International Conference on Data Engineering (ICDE). IEEE, 2177–2182.
- A Methodology for Creating Question Answering Corpora Using Inverse Data Annotation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Online, 897–911. https://doi.org/10.18653/v1/2020.acl-main.84
- Improving Text-to-SQL Evaluation Methodology. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Melbourne, Australia, 351–360. https://doi.org/10.18653/v1/P18-1033
- Towards Robustness of Text-to-SQL Models against Synonym Substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2505–2515. https://doi.org/10.18653/v1/2021.acl-long.195
- Natural SQL: Making SQL Easier to Infer from Natural Language Specifications. In Findings of the Association for Computational Linguistics: EMNLP 2021. Association for Computational Linguistics, Punta Cana, Dominican Republic, 2030–2042. https://doi.org/10.18653/v1/2021.findings-emnlp.174
- Question Generation from SQL Queries Improves Neural Semantic Parsing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 1597–1607. https://doi.org/10.18653/v1/D18-1188
- Towards Complex Text-to-SQL in Cross-Domain Database with Intermediate Representation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Florence, Italy, 4524–4535. https://doi.org/10.18653/v1/P19-1444
- Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. https://doi.org/10.48550/ARXIV.2106.05006
- Text-to-SQL in the Wild: A Naturally-Occurring Dataset Based on Stack Exchange Data. In Proceedings of the 1st Workshop on Natural Language Processing for Programming (NLP4Prog 2021). Association for Computational Linguistics, Online, 77–87. https://doi.org/10.18653/v1/2021.nlp4prog-1.9
- Unnatural instructions: Tuning language models with (almost) no human labor. arXiv preprint arXiv:2212.09689 (2022).
- A Comprehensive Exploration on WikiSQL with Table-Aware Word Contextualization. https://doi.org/10.48550/ARXIV.1902.01069
- Neural Approaches for Natural Language Interfaces to Databases: A Survey. In Proceedings of the 28th International Conference on Computational Linguistics. International Committee on Computational Linguistics, Barcelona, Spain (Online), 381–395. https://doi.org/10.18653/v1/2020.coling-main.34
- Learning a Neural Semantic Parser from User Feedback. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers, Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, 963–973.
- George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal (2023). https://doi.org/10.1007/s00778-022-00776-8
- KaggleDBQA: Realistic Evaluation of Text-to-SQL Parsers. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers). Association for Computational Linguistics, Online, 2261–2273. https://doi.org/10.18653/v1/2021.acl-long.176
- Fei Li and H. V. Jagadish. 2014a. Constructing an Interactive Natural Language Interface for Relational Databases. PVLDB 8, 1 (Sept. 2014), 73–84.
- Fei Li and Hosagrahar V Jagadish. 2014b. NaLIR: an interactive natural language interface for querying relational databases. In Proceedings of the 2014 ACM SIGMOD international conference on Management of data. 709–712.
- RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 13067–13075. https://doi.org/10.1609/aaai.v37i11.26535
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. arXiv:2305.03111 [cs.CL]
- Pingchuan Ma and Shuai Wang. 2021. MT-Teql: Evaluating and Augmenting Neural NLIDB on Real-World Linguistic and Schema Variations. Proc. VLDB Endow. 15, 3 (nov 2021), 569–582. https://doi.org/10.14778/3494124.3494139
- Bleu: a Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 311–318. https://doi.org/10.3115/1073083.1073135
- Towards a Theory of Natural Language Interfaces to Databases. In Proceedings of the 8th International Conference on Intelligent User Interfaces (Miami, Florida, USA) (IUI ’03). Association for Computing Machinery, New York, NY, USA, 149–157. https://doi.org/10.1145/604045.604070
- Matt Post. 2018. A Call for Clarity in Reporting BLEU Scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics, Brussels, Belgium, 186–191. https://doi.org/10.18653/v1/W18-6319
- ”Language Models are Unsupervised Multitask Learners”. (2019).
- Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019).
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). Association for Computational Linguistics, Hong Kong, China, 3982–3992. https://doi.org/10.18653/v1/D19-1410
- Roland Roller and Mark Stevenson. 2015. Making the most of limited training data using distant supervision. In Proceedings of BioNLP 15. Association for Computational Linguistics, Beijing, China, 12–20. https://doi.org/10.18653/v1/W15-3802
- Centroid-based Text Summarization through Compositionality of Word Embeddings. In Proceedings of the MultiLing 2017 Workshop on Summarization and Summary Evaluation Across Source Types and Genres. Association for Computational Linguistics, Valencia, Spain, 12–21. https://doi.org/10.18653/v1/W17-1003
- Ohad Rubin and Jonathan Berant. 2021. SmBoP: Semi-autoregressive Bottom-up Semantic Parsing. In Proceedings of the 5th Workshop on Structured Prediction for NLP (SPNLP 2021). Association for Computational Linguistics, Online, 12–21. https://doi.org/10.18653/v1/2021.spnlp-1.2
- ATHENA: an ontology-driven system for natural language querying over relational data stores. Proceedings of the VLDB Endowment 9, 12 (2016), 1209–1220.
- PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv:2109.05093 [cs.CL]
- ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13, 11 (2020), 2747–2759.
- Précis: from unstructured keywords as queries to structured databases as answers. The VLDB Journal 17, 1 (2008), 117–149.
- The SDSS skyserver: public access to the sloan digital sky server data. In Proceedings of the 2002 ACM SIGMOD international conference on Management of data. 570–581.
- Lappoon R. Tang and Raymond J. Mooney. 2000. Automated Construction of Database Interfaces: Intergrating Statistical and Relational Learning for Semantic Parsing. In 2000 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. Association for Computational Linguistics, Hong Kong, China, 133–141. https://doi.org/10.3115/1117794.1117811
- Attention is all you need. Advances in neural information processing systems 30 (2017).
- RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. arXiv:1911.04942 [cs.CL]
- Text-to-SQL Generation for Question Answering on Electronic Medical Records. In Proceedings of The Web Conference 2020. 350–361.
- DBPal: A Fully Pluggable NL2SQL Training Pipeline. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data (Portland, OR, USA) (SIGMOD ’20). Association for Computing Machinery, New York, NY, USA, 2347–2361. https://doi.org/10.1145/3318464.3380589
- Data Augmentation with Hierarchical SQL-to-Question Generation for Cross-domain Text-to-SQL Parsing. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online and Punta Cana, Dominican Republic, 8974–8983. https://doi.org/10.18653/v1/2021.emnlp-main.707
- SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv:1711.04436 [cs.CL]
- SQLizer: Query Synthesis from Natural Language. PACMPL, Article 63 (2017), 26 pages.
- GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In International Conference on Learning Representations. https://arxiv.org/abs/2009.13845
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.18653/v1/D18-1425
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. arXiv:1809.08887 [cs.CL]
- Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. arXiv:1709.00103 [cs.CL]
- Yi Zhang (994 papers)
- Jan Deriu (21 papers)
- George Katsogiannis-Meimarakis (1 paper)
- Catherine Kosten (4 papers)
- Georgia Koutrika (6 papers)
- Kurt Stockinger (22 papers)