Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NL2KQL: From Natural Language to Kusto Query (2404.02933v4)

Published 3 Apr 2024 in cs.DB, cs.AI, and cs.CL

Abstract: Data is growing rapidly in volume and complexity. Proficiency in database query languages is pivotal for crafting effective queries. As coding assistants become more prevalent, there is significant opportunity to enhance database query languages. The Kusto Query Language (KQL) is a widely used query language for large semi-structured data such as logs, telemetries, and time-series for big data analytics platforms. This paper introduces NL2KQL an innovative framework that uses LLMs to convert natural language queries (NLQs) to KQL queries. The proposed NL2KQL framework includes several key components: Schema Refiner which narrows down the schema to its most pertinent elements; the Few-shot Selector which dynamically selects relevant examples from a few-shot dataset; and the Query Refiner which repairs syntactic and semantic errors in KQL queries. Additionally, this study outlines a method for generating large datasets of synthetic NLQ-KQL pairs which are valid within a specific database contexts. To validate NL2KQL's performance, we utilize an array of online (based on query execution) and offline (based on query parsing) metrics. Through ablation studies, the significance of each framework component is examined, and the datasets used for benchmarking are made publicly available. This work is the first of its kind and is compared with available baselines to demonstrate its effectiveness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Toufique Ahmed and Premkumar Devanbu. 2023. Few-shot training LLMs for project-specific code-summarization. In Proceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering (¡conf-loc¿, ¡city¿Rochester¡/city¿, ¡state¿MI¡/state¿, ¡country¿USA¡/country¿, ¡/conf-loc¿) (ASE ’22). Association for Computing Machinery, New York, NY, USA, Article 177, 5 pages. https://doi.org/10.1145/3551349.3559555
  2. Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models. arXiv:2106.13353 [cs.CL]
  3. Bridging the Semantic Gap with SQL Query Logs in Natural Language Interfaces to Databases. In 2019 IEEE 35th International Conference on Data Engineering (ICDE). IEEE. https://doi.org/10.1109/icde.2019.00041
  4. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  5. Language Models are Few-Shot Learners. arXiv:2005.14165 [cs.CL]
  6. Ursin Brunner and Kurt Stockinger. 2021. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. arXiv:2006.00888 [cs.DB]
  7. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Regina Barzilay and Min-Yen Kan (Eds.). Association for Computational Linguistics, Vancouver, Canada, 1870–1879. https://doi.org/10.18653/v1/P17-1171
  8. Evaluating Large Language Models Trained on Code. ArXiv abs/2107.03374 (2021). https://api.semanticscholar.org/CorpusID:235755472
  9. Christopher Clark and Matt Gardner. 2018. Simple and Effective Multi-Paragraph Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Iryna Gurevych and Yusuke Miyao (Eds.). Association for Computational Linguistics, Melbourne, Australia, 845–855. https://doi.org/10.18653/v1/P18-1078
  10. PyMT5: Multi-mode Translation of Natural Language and Python Code with Transformers. In Conference on Empirical Methods in Natural Language Processing. https://api.semanticscholar.org/CorpusID:222178041
  11. Wizard of Wikipedia: Knowledge-Powered Conversational Agents. In International Conference on Learning Representations. https://openreview.net/forum?id=r1l73iRqKm
  12. CodeBERT: A Pre-Trained Model for Programming and Natural Languages. ArXiv abs/2002.08155 (2020). https://api.semanticscholar.org/CorpusID:211171605
  13. Learning deep structured semantic models for web search using clickthrough data. 2333–2338. https://doi.org/10.1145/2505515.2505665
  14. Gautier Izacard and Edouard Grave. 2022. Distilling Knowledge from Reader to Retriever for Question Answering. arXiv:2012.04584 [cs.CL]
  15. Dense Passage Retrieval for Open-Domain Question Answering. arXiv:2004.04906 [cs.CL]
  16. George Katsogiannis-Meimarakis and Georgia Koutrika. 2023. A survey on deep learning approaches for text-to-SQL. The VLDB Journal 32 (01 2023). https://doi.org/10.1007/s00778-022-00776-8
  17. Natural Questions: A Benchmark for Question Answering Research. Transactions of the Association for Computational Linguistics 7 (2019), 452–466. https://doi.org/10.1162/tacl_a_00276
  18. Fei Li and H. V. Jagadish. 2014. Constructing an interactive natural language interface for relational databases. Proc. VLDB Endow. 8, 1 (sep 2014), 73–84. https://doi.org/10.14778/2735461.2735468
  19. Language Models are Unsupervised Multitask Learners. https://api.semanticscholar.org/CorpusID:160025533
  20. ATHENA: an ontology-driven system for natural language querying over relational data stores. Proc. VLDB Endow. 9, 12 (aug 2016), 1209–1220. https://doi.org/10.14778/2994509.2994536
  21. PICARD: Parsing Incrementally for Constrained Auto-Regressive Decoding from Language Models. arXiv:2109.05093 [cs.CL]
  22. ATHENA++: Natural Language Querying for Complex Nested SQL Queries. Proc. VLDB Endow. 13 (2020), 2747–2759. https://api.semanticscholar.org/CorpusID:221348677
  23. Learning semantic representations using convolutional neural networks for web search. In Proceedings of the 23rd International Conference on World Wide Web (Seoul, Korea) (WWW ’14 Companion). Association for Computing Machinery, New York, NY, USA, 373–374. https://doi.org/10.1145/2567948.2577348
  24. Retrieval Augmentation Reduces Hallucination in Conversation. arXiv:2104.07567 [cs.CL]
  25. Herbert A. Simon. 1963. Experiments with a Heuristic Compiler. J. ACM 10, 4 (oct 1963), 493–506. https://doi.org/10.1145/321186.321192
  26. Phillip D. Summers. 1977. A Methodology for LISP Program Construction from Examples. J. ACM 24, 1 (jan 1977), 161–175. https://doi.org/10.1145/321992.322002
  27. FEVER: a large-scale dataset for Fact Extraction and VERification. arXiv:1803.05355 [cs.CL]
  28. Matching Networks for One Shot Learning. arXiv:1606.04080 [cs.LG]
  29. Ellen M. Voorhees and Dawn M. Tice. 2000. The TREC-8 Question Answering Track. In Proceedings of the Second International Conference on Language Resources and Evaluation (LREC’00), M. Gavrilidou, G. Carayannis, S. Markantonatou, S. Piperidis, and G. Stainhauer (Eds.). European Language Resources Association (ELRA), Athens, Greece. http://www.lrec-conf.org/proceedings/lrec2000/pdf/26.pdf
  30. Richard J. Waldinger and Richard C. T. Lee. 1969. PROW: a step toward automatic program writing. In Proceedings of the 1st International Joint Conference on Artificial Intelligence (Washington, DC) (IJCAI’69). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 241–252.
  31. RAT-SQL: Relation-Aware Schema Encoding and Linking for Text-to-SQL Parsers. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7567–7578. https://doi.org/10.18653/v1/2020.acl-main.677
  32. Multi-passage BERT: A Globally Normalized BERT Model for Open-domain Question Answering. arXiv:1908.08167 [cs.CL]
  33. Emergent Abilities of Large Language Models. arXiv:2206.07682 [cs.CL]
  34. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv abs/2201.11903 (2022). https://api.semanticscholar.org/CorpusID:246411621
  35. Approximate Nearest Neighbor Negative Contrastive Learning for Dense Text Retrieval. arXiv:2007.00808 [cs.IR]
  36. SQLNet: Generating Structured Queries From Natural Language Without Reinforcement Learning. arXiv:1711.04436 [cs.CL]
  37. Pretrained Transformers for Text Ranking: BERT and Beyond. In Proceedings of the 14th ACM International Conference on Web Search and Data Mining (Virtual Event, Israel) (WSDM ’21). Association for Computing Machinery, New York, NY, USA, 1154–1156. https://doi.org/10.1145/3437963.3441667
  38. Learning Discriminative Projections for Text Similarity Measures. In Proceedings of the Fifteenth Conference on Computational Natural Language Learning, Sharon Goldwater and Christopher Manning (Eds.). Association for Computational Linguistics, Portland, Oregon, USA, 247–256. https://aclanthology.org/W11-0329
  39. SyntaxSQLNet: Syntax Tree Networks for Complex and Cross-DomainText-to-SQL Task. arXiv:1810.05237 [cs.CL]
  40. Spider: A Large-Scale Human-Labeled Dataset for Complex and Cross-Domain Semantic Parsing and Text-to-SQL Task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii (Eds.). Association for Computational Linguistics, Brussels, Belgium, 3911–3921. https://doi.org/10.18653/v1/D18-1425
  41. Seq2SQL: Generating Structured Queries from Natural Language using Reinforcement Learning. CoRR abs/1709.00103 (2017).
  42. Productivity assessment of neural code completion. In Proceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming (San Diego, CA, USA) (MAPS 2022). Association for Computing Machinery, New York, NY, USA, 21–29. https://doi.org/10.1145/3520312.3534864
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com