DBCopilot: Natural Language Querying over Massive Databases via Schema Routing (2312.03463v3)
Abstract: The development of Natural Language Interfaces to Databases (NLIDBs) has been greatly advanced by the advent of LLMs, which provide an intuitive way to translate natural language (NL) questions into Structured Query Language (SQL) queries. While significant progress has been made in LLM-based NL2SQL, existing approaches face several challenges in real-world scenarios of natural language querying over massive databases. In this paper, we present DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing over massive databases. Specifically, DBCopilot decouples schema-agnostic NL2SQL into schema routing and SQL generation. This framework utilizes a single lightweight differentiable search index to construct semantic mappings for massive database schemata, and navigates natural language questions to their target databases and tables in a relation-aware joint retrieval manner. The routed schemata and questions are then fed into LLMs for effective SQL generation. Furthermore, DBCopilot introduces a reverse schema-to-question generation paradigm that can automatically learn and adapt the router over massive databases without manual intervention. Experimental results verify that DBCopilot is a scalable and effective solution for schema-agnostic NL2SQL, providing a significant advance in handling natural language querying over massive databases for NLIDBs.
- A comparative survey of recent natural language interfaces for databases. VLDB J. 28, 5 (2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8
- From Large Language Models to Databases and Back: A discussion on research and education. CoRR abs/2306.01388 (2023). https://doi.org/10.48550/arXiv.2306.01388 arXiv:2306.01388
- Natural language interfaces to databases - an introduction. Nat. Lang. Eng. 1, 1 (1995), 29–81. https://doi.org/10.1017/S135132490000005X
- Data Integration for the Relational Web. Proc. VLDB Endow. 2, 1 (2009), 1090–1101. https://doi.org/10.14778/1687627.1687750
- Autoregressive Entity Retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=5k8F6UU39V
- Shuaichen Chang and Eric Fosler-Lussier. 2023. Selective Demonstrations for Cross-domain Text-to-SQL. arXiv:2310.06302 [cs.CL]
- Open Question Answering over Tables and Text. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=MmCRswl1UYl
- Structure-Grounded Pretraining for Text-to-SQL. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 1337–1350. https://doi.org/10.18653/v1/2021.naacl-main.105
- TURL: Table Understanding through Representation Learning. Proc. VLDB Endow. 14, 3 (2020), 307–319. https://doi.org/10.5555/3430915.3442430
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
- Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021. IEEE, 456–467. https://doi.org/10.1109/ICDE51399.2021.00046
- Table Discovery in Data Lakes: State-of-the-art and Future Directions. In Companion of the 2023 International Conference on Management of Data, SIGMOD/PODS 2023, Seattle, WA, USA, June 18-23, 2023, Sudipto Das, Ippokratis Pandis, K. Selçuk Candan, and Sihem Amer-Yahia (Eds.). ACM, 69–75. https://doi.org/10.1145/3555041.3589409
- Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. Proc. VLDB Endow. 16, 7 (2023), 1726–1739. https://www.vldb.org/pvldb/vol16/p1726-fan.pdf
- Aurum: A Data Discovery System. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018. IEEE Computer Society, 1001–1012. https://doi.org/10.1109/ICDE.2018.00094
- Towards Robustness of Text-to-SQL Models against Synonym Substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2505–2515. https://doi.org/10.18653/v1/2021.acl-long.195
- Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. CoRR abs/2308.15363 (2023). https://doi.org/10.48550/arXiv.2308.15363 arXiv:2308.15363
- Alessandra Giordani and Alessandro Moschitti. 2012. Translating Questions to SQL Queries with Generative Parsers Discriminatively Reranked. In COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Posters, 8-15 December 2012, Mumbai, India, Martin Kay and Christian Boitet (Eds.). Indian Institute of Technology Bombay, 401–410. https://aclanthology.org/C12-2040/
- Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proc. ACM Manag. Data 1, 2 (2023), 147:1–147:28. https://doi.org/10.1145/3589292
- Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation. CoRR abs/2306.08891 (2023). https://doi.org/10.48550/arXiv.2306.08891 arXiv:2306.08891
- Prompting GPT-3.5 for Text-to-SQL with De-semanticization and Skeleton Retrieval. arXiv:2304.13301 [cs.CL]
- Open Domain Question Answering over Tables via Dense Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 512–519. https://doi.org/10.18653/V1/2021.NAACL-MAIN.43
- A Survey on Table Question Answering: Recent Advances. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers the Digital Economy - 7th China Conference, CCKS 2022, Qinhuangdao, China, August 24-27, 2022, Revised Selected Papers (Communications in Computer and Information Science), Maosong Sun, Guilin Qi, Kang Liu, Jiadong Ren, Bin Xu, Yansong Feng, Yongbin Liu, and Yubo Chen (Eds.), Vol. 1669. Springer, 174–186. https://doi.org/10.1007/978-981-19-7596-7_14
- Aishwarya Kamath and Rajarshi Das. 2019. A Survey on Semantic Parsing. In 1st Conference on Automated Knowledge Base Construction, AKBC 2019, Amherst, MA, USA, May 20-22, 2019. https://doi.org/10.24432/C5WC7D
- Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 6769–6781. https://doi.org/10.18653/V1/2020.EMNLP-MAIN.550
- George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. VLDB J. 32, 4 (2023), 905–936. https://doi.org/10.1007/s00778-022-00776-8
- SANTOS: Relationship-based Semantic Table Union Search. Proc. ACM Manag. Data 1, 1 (2023), 9:1–9:25. https://doi.org/10.1145/3588689
- CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. CoRR abs/2311.01173 (2023). https://doi.org/10.48550/ARXIV.2311.01173 arXiv:2311.01173
- RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 13067–13075. https://doi.org/10.1609/aaai.v37i11.26535
- Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 13076–13084. https://doi.org/10.1609/aaai.v37i11.26536
- Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. CoRR abs/2305.03111 (2023). https://doi.org/10.48550/arXiv.2305.03111 arXiv:2305.03111
- Discovering Enterprise Concepts Using Spreadsheet Tables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1873–1882. https://doi.org/10.1145/3097983.3098102
- Context Dependent Semantic Parsing: A Survey. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, Donia Scott, Núria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, 2509–2521. https://doi.org/10.18653/v1/2020.coling-main.226
- Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3, 1 (2010), 1338–1347. https://doi.org/10.14778/1920841.1921005
- A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability. CoRR abs/2303.13547 (2023). https://doi.org/10.48550/arXiv.2303.13547 arXiv:2303.13547
- Lost in the Middle: How Language Models Use Long Contexts. CoRR abs/2307.03172 (2023). https://doi.org/10.48550/ARXIV.2307.03172 arXiv:2307.03172
- Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Bkg6RiCqY7
- Organizing Data Lakes for Navigation. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1939–1950. https://doi.org/10.1145/3318464.3380605
- Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989. https://doi.org/10.14778/3352063.3352116
- Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow. 13, 7 (2020), 953–965. https://doi.org/10.14778/3384345.3384346
- Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. Proc. VLDB Endow. 5, 10 (2012), 908–919. https://doi.org/10.14778/2336664.2336665
- Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. CoRR abs/2304.11015 (2023). https://doi.org/10.48550/arXiv.2304.11015 arXiv:2304.11015
- A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions. CoRR abs/2208.13629 (2022). https://doi.org/10.48550/arXiv.2208.13629 arXiv:2208.13629
- Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
- Evaluating the Text-to-SQL Capabilities of Large Language Models. CoRR abs/2204.00498 (2022). https://doi.org/10.48550/arXiv.2204.00498 arXiv:2204.00498
- Recommender Systems with Generative Retrieval. CoRR abs/2305.05065 (2023). https://doi.org/10.48550/arXiv.2305.05065 arXiv:2305.05065
- Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. https://doi.org/10.18653/v1/D19-1410
- Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman (Eds.). ACM, 817–828. https://doi.org/10.1145/2213836.2213962
- PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 9895–9901. https://doi.org/10.18653/v1/2021.emnlp-main.779
- ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13, 11 (2020), 2747–2759. http://www.vldb.org/pvldb/vol13/p2747-sen.pdf
- Open Vocabulary Extreme Classification Using Generative Models. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 1561–1583. https://doi.org/10.18653/v1/2022.findings-acl.123
- Learning to Tokenize for Generative Retrieval. CoRR abs/2304.04171 (2023). https://doi.org/10.48550/arXiv.2304.04171 arXiv:2304.04171
- Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. CoRR abs/2304.09542 (2023). https://doi.org/10.48550/ARXIV.2304.09542 arXiv:2304.09542
- Transformer Memory as a Differentiable Search Index. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/892840a6123b5ec99ebaab8be1530fba-Abstract-Conference.html
- Galactica: A Large Language Model for Science. CoRR abs/2211.09085 (2022). https://doi.org/10.48550/arXiv.2211.09085 arXiv:2211.09085
- Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4, 9 (2011), 528–538. https://doi.org/10.14778/2002938.2002939
- Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. CoRR abs/1610.02424 (2016). arXiv:1610.02424 http://arxiv.org/abs/1610.02424
- RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
- A Neural Corpus Indexer for Document Retrieval. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/a46156bd3579c3b268108ea6aca71d13-Abstract-Conference.html
- Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022 (2022). https://openreview.net/forum?id=yzkSU5zdwD
- Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
- Wikipedia contributors. 2023a. Okapi BM25 — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Okapi_BM25&oldid=1180365145 [Online; accessed 16-October-2023].
- Wikipedia contributors. 2023b. Path (graph theory) — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Path_(graph_theory)&oldid=1167548844 [Online; accessed 10-October-2023].
- Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, Qun Liu and David Schlangen (Eds.). Association for Computational Linguistics, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
- William A. Woods. 1973. Progress in natural language understanding: an application to lunar geology. In American Federation of Information Processing Societies: 1973 National Computer Conference, 4-8 June 1973, New York, NY, USA (AFIPS Conference Proceedings), Vol. 42. AFIPS Press/ACM, 441–450. https://doi.org/10.1145/1499586.1499695
- The Rise and Potential of Large Language Model Based Agents: A Survey. CoRR abs/2309.07864 (2023). https://doi.org/10.48550/arXiv.2309.07864 arXiv:2309.07864
- Retrieval meets Long Context Large Language Models. CoRR abs/2310.03025 (2023). https://doi.org/10.48550/ARXIV.2310.03025 arXiv:2310.03025
- SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. CoRR abs/1711.04436 (2017). arXiv:1711.04436 http://arxiv.org/abs/1711.04436
- GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=kyaIeYj4zZ
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 3911–3921. https://doi.org/10.18653/v1/d18-1425
- John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 4-8, 1996, Volume 2, William J. Clancey and Daniel S. Weld (Eds.). AAAI Press / The MIT Press, 1050–1055. http://www.aaai.org/Library/AAAI/1996/aaai96-156.php
- Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines. CoRR abs/2305.13859 (2023). https://doi.org/10.48550/arXiv.2305.13859 arXiv:2305.13859
- Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1951–1966. https://doi.org/10.1145/3318464.3389726
- A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223
- JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 847–864. https://doi.org/10.1145/3299869.3300065
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.