Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 33 tok/s Pro
GPT-5 High 30 tok/s Pro
GPT-4o 110 tok/s Pro
Kimi K2 207 tok/s Pro
GPT OSS 120B 467 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

DBCopilot: Natural Language Querying over Massive Databases via Schema Routing (2312.03463v3)

Published 6 Dec 2023 in cs.CL, cs.DB, and cs.IR

Abstract: The development of Natural Language Interfaces to Databases (NLIDBs) has been greatly advanced by the advent of LLMs, which provide an intuitive way to translate natural language (NL) questions into Structured Query Language (SQL) queries. While significant progress has been made in LLM-based NL2SQL, existing approaches face several challenges in real-world scenarios of natural language querying over massive databases. In this paper, we present DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing over massive databases. Specifically, DBCopilot decouples schema-agnostic NL2SQL into schema routing and SQL generation. This framework utilizes a single lightweight differentiable search index to construct semantic mappings for massive database schemata, and navigates natural language questions to their target databases and tables in a relation-aware joint retrieval manner. The routed schemata and questions are then fed into LLMs for effective SQL generation. Furthermore, DBCopilot introduces a reverse schema-to-question generation paradigm that can automatically learn and adapt the router over massive databases without manual intervention. Experimental results verify that DBCopilot is a scalable and effective solution for schema-agnostic NL2SQL, providing a significant advance in handling natural language querying over massive databases for NLIDBs.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (75)
  1. A comparative survey of recent natural language interfaces for databases. VLDB J. 28, 5 (2019), 793–819. https://doi.org/10.1007/s00778-019-00567-8
  2. From Large Language Models to Databases and Back: A discussion on research and education. CoRR abs/2306.01388 (2023). https://doi.org/10.48550/arXiv.2306.01388 arXiv:2306.01388
  3. Natural language interfaces to databases - an introduction. Nat. Lang. Eng. 1, 1 (1995), 29–81. https://doi.org/10.1017/S135132490000005X
  4. Data Integration for the Relational Web. Proc. VLDB Endow. 2, 1 (2009), 1090–1101. https://doi.org/10.14778/1687627.1687750
  5. Autoregressive Entity Retrieval. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=5k8F6UU39V
  6. Shuaichen Chang and Eric Fosler-Lussier. 2023. Selective Demonstrations for Cross-domain Text-to-SQL. arXiv:2310.06302 [cs.CL]
  7. Open Question Answering over Tables and Text. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=MmCRswl1UYl
  8. Structure-Grounded Pretraining for Text-to-SQL. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 1337–1350. https://doi.org/10.18653/v1/2021.naacl-main.105
  9. TURL: Table Understanding through Representation Learning. Proc. VLDB Endow. 14, 3 (2020), 307–319. https://doi.org/10.5555/3430915.3442430
  10. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, 4171–4186. https://doi.org/10.18653/v1/n19-1423
  11. Efficient Joinable Table Discovery in Data Lakes: A High-Dimensional Similarity-Based Approach. In 37th IEEE International Conference on Data Engineering, ICDE 2021, Chania, Greece, April 19-22, 2021. IEEE, 456–467. https://doi.org/10.1109/ICDE51399.2021.00046
  12. Table Discovery in Data Lakes: State-of-the-art and Future Directions. In Companion of the 2023 International Conference on Management of Data, SIGMOD/PODS 2023, Seattle, WA, USA, June 18-23, 2023, Sudipto Das, Ippokratis Pandis, K. Selçuk Candan, and Sihem Amer-Yahia (Eds.). ACM, 69–75. https://doi.org/10.1145/3555041.3589409
  13. Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. Proc. VLDB Endow. 16, 7 (2023), 1726–1739. https://www.vldb.org/pvldb/vol16/p1726-fan.pdf
  14. Aurum: A Data Discovery System. In 34th IEEE International Conference on Data Engineering, ICDE 2018, Paris, France, April 16-19, 2018. IEEE Computer Society, 1001–1012. https://doi.org/10.1109/ICDE.2018.00094
  15. Towards Robustness of Text-to-SQL Models against Synonym Substitution. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers), Virtual Event, August 1-6, 2021, Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli (Eds.). Association for Computational Linguistics, 2505–2515. https://doi.org/10.18653/v1/2021.acl-long.195
  16. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation. CoRR abs/2308.15363 (2023). https://doi.org/10.48550/arXiv.2308.15363 arXiv:2308.15363
  17. Alessandra Giordani and Alessandro Moschitti. 2012. Translating Questions to SQL Queries with Generative Parsers Discriminatively Reranked. In COLING 2012, 24th International Conference on Computational Linguistics, Proceedings of the Conference: Posters, 8-15 December 2012, Mumbai, India, Martin Kay and Christian Boitet (Eds.). Indian Institute of Technology Bombay, 401–410. https://aclanthology.org/C12-2040/
  18. Few-shot Text-to-SQL Translation using Structure and Content Prompt Learning. Proc. ACM Manag. Data 1, 2 (2023), 147:1–147:28. https://doi.org/10.1145/3589292
  19. Interleaving Pre-Trained Language Models and Large Language Models for Zero-Shot NL2SQL Generation. CoRR abs/2306.08891 (2023). https://doi.org/10.48550/arXiv.2306.08891 arXiv:2306.08891
  20. Prompting GPT-3.5 for Text-to-SQL with De-semanticization and Skeleton Retrieval. arXiv:2304.13301 [cs.CL]
  21. Open Domain Question Answering over Tables via Dense Retrieval. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2021, Online, June 6-11, 2021, Kristina Toutanova, Anna Rumshisky, Luke Zettlemoyer, Dilek Hakkani-Tür, Iz Beltagy, Steven Bethard, Ryan Cotterell, Tanmoy Chakraborty, and Yichao Zhou (Eds.). Association for Computational Linguistics, 512–519. https://doi.org/10.18653/V1/2021.NAACL-MAIN.43
  22. A Survey on Table Question Answering: Recent Advances. In Knowledge Graph and Semantic Computing: Knowledge Graph Empowers the Digital Economy - 7th China Conference, CCKS 2022, Qinhuangdao, China, August 24-27, 2022, Revised Selected Papers (Communications in Computer and Information Science), Maosong Sun, Guilin Qi, Kang Liu, Jiadong Ren, Bin Xu, Yansong Feng, Yongbin Liu, and Yubo Chen (Eds.), Vol. 1669. Springer, 174–186. https://doi.org/10.1007/978-981-19-7596-7_14
  23. Aishwarya Kamath and Rajarshi Das. 2019. A Survey on Semantic Parsing. In 1st Conference on Automated Knowledge Base Construction, AKBC 2019, Amherst, MA, USA, May 20-22, 2019. https://doi.org/10.24432/C5WC7D
  24. Dense Passage Retrieval for Open-Domain Question Answering. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, Bonnie Webber, Trevor Cohn, Yulan He, and Yang Liu (Eds.). Association for Computational Linguistics, 6769–6781. https://doi.org/10.18653/V1/2020.EMNLP-MAIN.550
  25. George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. VLDB J. 32, 4 (2023), 905–936. https://doi.org/10.1007/s00778-022-00776-8
  26. SANTOS: Relationship-based Semantic Table Union Search. Proc. ACM Manag. Data 1, 1 (2023), 9:1–9:25. https://doi.org/10.1145/3588689
  27. CRUSH4SQL: Collective Retrieval Using Schema Hallucination For Text2SQL. CoRR abs/2311.01173 (2023). https://doi.org/10.48550/ARXIV.2311.01173 arXiv:2311.01173
  28. RESDSQL: Decoupling Schema Linking and Skeleton Parsing for Text-to-SQL. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 13067–13075. https://doi.org/10.1609/aaai.v37i11.26535
  29. Graphix-T5: Mixing Pre-trained Transformers with Graph-Aware Layers for Text-to-SQL Parsing. In Thirty-Seventh AAAI Conference on Artificial Intelligence, AAAI 2023, Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence, IAAI 2023, Thirteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2023, Washington, DC, USA, February 7-14, 2023, Brian Williams, Yiling Chen, and Jennifer Neville (Eds.). AAAI Press, 13076–13084. https://doi.org/10.1609/aaai.v37i11.26536
  30. Can LLM Already Serve as A Database Interface? A BIg Bench for Large-Scale Database Grounded Text-to-SQLs. CoRR abs/2305.03111 (2023). https://doi.org/10.48550/arXiv.2305.03111 arXiv:2305.03111
  31. Discovering Enterprise Concepts Using Spreadsheet Tables. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Halifax, NS, Canada, August 13 - 17, 2017. ACM, 1873–1882. https://doi.org/10.1145/3097983.3098102
  32. Context Dependent Semantic Parsing: A Survey. In Proceedings of the 28th International Conference on Computational Linguistics, COLING 2020, Barcelona, Spain (Online), December 8-13, 2020, Donia Scott, Núria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, 2509–2521. https://doi.org/10.18653/v1/2020.coling-main.226
  33. Annotating and Searching Web Tables Using Entities, Types and Relationships. Proc. VLDB Endow. 3, 1 (2010), 1338–1347. https://doi.org/10.14778/1920841.1921005
  34. A comprehensive evaluation of ChatGPT’s zero-shot Text-to-SQL capability. CoRR abs/2303.13547 (2023). https://doi.org/10.48550/arXiv.2303.13547 arXiv:2303.13547
  35. Lost in the Middle: How Language Models Use Long Contexts. CoRR abs/2307.03172 (2023). https://doi.org/10.48550/ARXIV.2307.03172 arXiv:2307.03172
  36. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net. https://openreview.net/forum?id=Bkg6RiCqY7
  37. Organizing Data Lakes for Navigation. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1939–1950. https://doi.org/10.1145/3318464.3380605
  38. Data Lake Management: Challenges and Opportunities. Proc. VLDB Endow. 12, 12 (2019), 1986–1989. https://doi.org/10.14778/3352063.3352116
  39. Data-Driven Domain Discovery for Structured Datasets. Proc. VLDB Endow. 13, 7 (2020), 953–965. https://doi.org/10.14778/3384345.3384346
  40. Rakesh Pimplikar and Sunita Sarawagi. 2012. Answering Table Queries on the Web using Column Keywords. Proc. VLDB Endow. 5, 10 (2012), 908–919. https://doi.org/10.14778/2336664.2336665
  41. Mohammadreza Pourreza and Davood Rafiei. 2023. DIN-SQL: Decomposed In-Context Learning of Text-to-SQL with Self-Correction. CoRR abs/2304.11015 (2023). https://doi.org/10.48550/arXiv.2304.11015 arXiv:2304.11015
  42. A Survey on Text-to-SQL Parsing: Concepts, Methods, and Future Directions. CoRR abs/2208.13629 (2022). https://doi.org/10.48550/arXiv.2208.13629 arXiv:2208.13629
  43. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 21 (2020), 140:1–140:67. http://jmlr.org/papers/v21/20-074.html
  44. Evaluating the Text-to-SQL Capabilities of Large Language Models. CoRR abs/2204.00498 (2022). https://doi.org/10.48550/arXiv.2204.00498 arXiv:2204.00498
  45. Recommender Systems with Generative Retrieval. CoRR abs/2305.05065 (2023). https://doi.org/10.48550/arXiv.2305.05065 arXiv:2305.05065
  46. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP 2019, Hong Kong, China, November 3-7, 2019, Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (Eds.). Association for Computational Linguistics, 3980–3990. https://doi.org/10.18653/v1/D19-1410
  47. Finding related tables. In Proceedings of the ACM SIGMOD International Conference on Management of Data, SIGMOD 2012, Scottsdale, AZ, USA, May 20-24, 2012, K. Selçuk Candan, Yi Chen, Richard T. Snodgrass, Luis Gravano, and Ariel Fuxman (Eds.). ACM, 817–828. https://doi.org/10.1145/2213836.2213962
  48. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (Eds.). Association for Computational Linguistics, 9895–9901. https://doi.org/10.18653/v1/2021.emnlp-main.779
  49. ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13, 11 (2020), 2747–2759. http://www.vldb.org/pvldb/vol13/p2747-sen.pdf
  50. Open Vocabulary Extreme Classification Using Generative Models. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, May 22-27, 2022, Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (Eds.). Association for Computational Linguistics, 1561–1583. https://doi.org/10.18653/v1/2022.findings-acl.123
  51. Learning to Tokenize for Generative Retrieval. CoRR abs/2304.04171 (2023). https://doi.org/10.48550/arXiv.2304.04171 arXiv:2304.04171
  52. Is ChatGPT Good at Search? Investigating Large Language Models as Re-Ranking Agent. CoRR abs/2304.09542 (2023). https://doi.org/10.48550/ARXIV.2304.09542 arXiv:2304.09542
  53. Transformer Memory as a Differentiable Search Index. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/892840a6123b5ec99ebaab8be1530fba-Abstract-Conference.html
  54. Galactica: A Large Language Model for Science. CoRR abs/2211.09085 (2022). https://doi.org/10.48550/arXiv.2211.09085 arXiv:2211.09085
  55. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (Eds.). 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  56. Recovering Semantics of Tables on the Web. Proc. VLDB Endow. 4, 9 (2011), 528–538. https://doi.org/10.14778/2002938.2002939
  57. Diverse Beam Search: Decoding Diverse Solutions from Neural Sequence Models. CoRR abs/1610.02424 (2016). arXiv:1610.02424 http://arxiv.org/abs/1610.02424
  58. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5-10, 2020, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel R. Tetreault (Eds.). Association for Computational Linguistics, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
  59. A Neural Corpus Indexer for Document Retrieval. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/a46156bd3579c3b268108ea6aca71d13-Abstract-Conference.html
  60. Emergent Abilities of Large Language Models. Trans. Mach. Learn. Res. 2022 (2022). https://openreview.net/forum?id=yzkSU5zdwD
  61. Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. In NeurIPS. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html
  62. Wikipedia contributors. 2023a. Okapi BM25 — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Okapi_BM25&oldid=1180365145 [Online; accessed 16-October-2023].
  63. Wikipedia contributors. 2023b. Path (graph theory) — Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=Path_(graph_theory)&oldid=1167548844 [Online; accessed 10-October-2023].
  64. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, EMNLP 2020 - Demos, Online, November 16-20, 2020, Qun Liu and David Schlangen (Eds.). Association for Computational Linguistics, 38–45. https://doi.org/10.18653/v1/2020.emnlp-demos.6
  65. William A. Woods. 1973. Progress in natural language understanding: an application to lunar geology. In American Federation of Information Processing Societies: 1973 National Computer Conference, 4-8 June 1973, New York, NY, USA (AFIPS Conference Proceedings), Vol. 42. AFIPS Press/ACM, 441–450. https://doi.org/10.1145/1499586.1499695
  66. The Rise and Potential of Large Language Model Based Agents: A Survey. CoRR abs/2309.07864 (2023). https://doi.org/10.48550/arXiv.2309.07864 arXiv:2309.07864
  67. Retrieval meets Long Context Large Language Models. CoRR abs/2310.03025 (2023). https://doi.org/10.48550/ARXIV.2310.03025 arXiv:2310.03025
  68. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. CoRR abs/1711.04436 (2017). arXiv:1711.04436 http://arxiv.org/abs/1711.04436
  69. GraPPa: Grammar-Augmented Pre-Training for Table Semantic Parsing. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net. https://openreview.net/forum?id=kyaIeYj4zZ
  70. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, 3911–3921. https://doi.org/10.18653/v1/d18-1425
  71. John M. Zelle and Raymond J. Mooney. 1996. Learning to Parse Database Queries Using Inductive Logic Programming. In Proceedings of the Thirteenth National Conference on Artificial Intelligence and Eighth Innovative Applications of Artificial Intelligence Conference, AAAI 96, IAAI 96, Portland, Oregon, USA, August 4-8, 1996, Volume 2, William J. Clancey and Daniel S. Weld (Eds.). AAAI Press / The MIT Press, 1050–1055. http://www.aaai.org/Library/AAAI/1996/aaai96-156.php
  72. Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines. CoRR abs/2305.13859 (2023). https://doi.org/10.48550/arXiv.2305.13859 arXiv:2305.13859
  73. Yi Zhang and Zachary G. Ives. 2020. Finding Related Tables in Data Lakes for Interactive Data Science. In Proceedings of the 2020 International Conference on Management of Data, SIGMOD Conference 2020, online conference [Portland, OR, USA], June 14-19, 2020, David Maier, Rachel Pottinger, AnHai Doan, Wang-Chiew Tan, Abdussalam Alawini, and Hung Q. Ngo (Eds.). ACM, 1951–1966. https://doi.org/10.1145/3318464.3389726
  74. A Survey of Large Language Models. CoRR abs/2303.18223 (2023). https://doi.org/10.48550/arXiv.2303.18223 arXiv:2303.18223
  75. JOSIE: Overlap Set Similarity Search for Finding Joinable Tables in Data Lakes. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference 2019, Amsterdam, The Netherlands, June 30 - July 5, 2019, Peter A. Boncz, Stefan Manegold, Anastasia Ailamaki, Amol Deshpande, and Tim Kraska (Eds.). ACM, 847–864. https://doi.org/10.1145/3299869.3300065
Citations (11)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.