NL2KQL: From Natural Language to Kusto Query (2404.02933v4)
Abstract: Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses LLMs to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.
- Toufique Ahmed and Premkumar Devanbu. 2023. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (¡conf-loc¿, ¡city¿Rochester¡/city¿, ¡state¿MI¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 177, 5 pages. https://doi.org/10.1145/3551349.3559555
- Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. arXiv:2106.13353 [cs.CL]
- Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE. https://doi.org/10.1109/icde.2019.00041
- Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
- Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
- Ursin Brunner and Kurt Stockinger. 2021. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. arXiv:2006.00888 [cs.DB]
- Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
- Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021). https://api.semanticscholar.org/CorpusID:235755472
- Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 845–855. https://doi.org/10.18653/v1/P18-1078
- PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:222178041
- Wizard of Wikipedia: Knowledge-Powered Conversational Agents. In International Conference on Learning Representations. https://openreview.net/forum?id=r1l73iRqKm
- CodeBERT: A Pre-Trained Model for Programming and Natural Languages. ArXiv abs/2002.08155 (2020). https://api.semanticscholar.org/CorpusID:211171605
- Learning deep structured semantic models for web search using clickthrough data. 2333–2338. https://doi.org/10.1145/2505515.2505665
- Gautier Izacard and Edouard Grave. 2022. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv:2012.04584 [cs.CL]
- Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906 [cs.CL]
- George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal 32 (01 2023). https://doi.org/10.1007/s00778-022-00776-8
- Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466. https://doi.org/10.1162/tacl_a_00276
- Fei Li and H. V. Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8, 1 (sep 2014), 73–84. https://doi.org/10.14778/2735461.2735468
- Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
- ATHENA: an ontology-driven system for natural language querying over relational data stores. Proc. VLDB Endow. 9, 12 (aug 2016), 1209–1220. https://doi.org/10.14778/2994509.2994536
- PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv:2109.05093 [cs.CL]
- ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13 (2020), 2747–2759. https://api.semanticscholar.org/CorpusID:221348677
- Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW ’14 Companion). Association for Computing Machinery, New York, NY, USA, 373–374. https://doi.org/10.1145/2567948.2577348
- Retrieval Augmentation Reduces Hallucination in Conversation. arXiv:2104.07567 [cs.CL]
- Herbert A. Simon. 1963. Experiments with a Heuristic Compiler. J. ACM 10, 4 (oct 1963), 493–506. https://doi.org/10.1145/321186.321192
- Phillip D. Summers. 1977. A Methodology for LISP Program Construction from Examples. J. ACM 24, 1 (jan 1977), 161–175. https://doi.org/10.1145/321992.322002
- FEVER: a large-scale dataset for Fact Extraction and VERification. arXiv:1803.05355 [cs.CL]
- Matching Networks for One Shot Learning. arXiv:1606.04080 [cs.LG]
- Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhauer (Eds.). European Language Resources Association (ELRA), Athens, Greece. http://www.lrec-conf.org/proceedings/lrec2000/pdf/26.pdf
- Richard J. Waldinger and Richard C. T. Lee. 1969. PROW: a step toward automatic program writing. In Proceedings of the 1st International Joint Conference on Artificial Intelligence (Washington, DC) (IJCAI’69). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 241–252.
- RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
- Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering. arXiv:1908.08167 [cs.CL]
- Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs.CL]
- Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022). https://api.semanticscholar.org/CorpusID:246411621
- Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv:2007.00808 [cs.IR]
- SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv:1711.04436 [cs.CL]
- Pretrained Transformers for Text Ranking: BERT and Beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 1154–1156. https://doi.org/10.1145/3437963.3441667
- Learning Discriminative Projections for Text Similarity Measures. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Sharon Goldwater and Christopher Manning (Eds.). Association for Computational Linguistics, Portland, Oregon, USA, 247–256. https://aclanthology.org/W11-0329
- SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task. arXiv:1810.05237 [cs.CL]
- Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.18653/v1/D18-1425
- Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017).
- Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 21–29. https://doi.org/10.1145/3520312.3534864