Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
167 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

From Text to CQL: Bridging Natural Language and Corpus Search Engine (2402.13740v1)

Published 21 Feb 2024 in cs.CL

Abstract: NLP technologies have revolutionized the way we interact with information systems, with a significant focus on converting natural language queries into formal query languages such as SQL. However, less emphasis has been placed on the Corpus Query Language (CQL), a critical tool for linguistic research and detailed analysis within text corpora. The manual construction of CQL queries is a complex and time-intensive task that requires a great deal of expertise, which presents a notable challenge for both researchers and practitioners. This paper presents the first text-to-CQL task that aims to automate the translation of natural language into CQL. We present a comprehensive framework for this task, including a specifically curated large-scale dataset and methodologies leveraging LLMs for effective text-to-CQL task. In addition, we established advanced evaluation metrics to assess the syntactic and semantic accuracy of the generated queries. We created innovative LLM-based conversion approaches and detailed experiments. The results demonstrate the efficacy of our methods and provide insights into the complexities of text-to-CQL task.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. Giusepppe Attardi. 2015. Wikiextractor. https://github.com/attardi/wikiextractor.
  2. Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, pages 18–26.
  3. Creating research environments with blacklab. CLARIN in the Low Countries, pages 245–257.
  4. Recent advances in text-to-SQL: A survey of what we have and what we expect. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2166–2187, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
  5. Ludovic Denoyer and Patrick Gallinari. 2006. The wikipedia xml corpus. In ACM SIGIR Forum, volume 40, pages 64–69. ACM New York, NY, USA.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. C3: Zero-shot text-to-sql with chatgpt.
  8. Text-to-sql empowered by large language models: A benchmark evaluation.
  9. Andrew Hardie. 2012. Cqpweb—combining power, flexibility and usability in a corpus analysis tool. International journal of corpus linguistics, 17(3):380–409.
  10. Renfen Hu and Hang Xiao. 2019. The construction of chinese collocation knowledge bases and their application in second language acquisition. Applied Linguistics, (1):135–144.
  11. The sketch engine: ten years on. Lexicography, 1(1):7–36.
  12. The sketch engine. Practical Lexicography: a reader, pages 297–306.
  13. BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461.
  14. Resdsql: decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press.
  15. Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111.
  16. TAPEX: Table pre-training via learning a neural SQL executor. In International Conference on Learning Representations.
  17. The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
  18. Gpt-4 technical report.
  19. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  20. Mohammadreza Pourreza and Davood Rafiei. 2023. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.
  21. Language models are unsupervised multitask learners.
  22. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
  23. Codebleu: a method for automatic evaluation of code synthesis.
  24. Cpt: A pre-trained unbalanced transformer for both chinese language understanding and generation. arXiv preprint arXiv:2109.05729.
  25. Text-to-overpassql: A natural language interface for complex geodata querying of openstreetmap.
  26. Battle of the large language models: Dolly vs llama vs vicuna vs guanaco vs bard vs chatgpt – a text-to-sql parsing comparison.
  27. The penn treebank: an overview. Treebanks: Building and using parsed corpora, pages 5–22.
  28. Grammar prompting for domain-specific language generation with large language models.
  29. Transformation of enhanced dependencies in chinese. In Proceedings of the 21st Chinese National Conference on Computational Linguistics, pages 99–109.
  30. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
  31. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
  32. Seq2sql: Generating structured queries from natural language using reinforcement learning.
Citations (1)

Summary

We haven't generated a summary for this paper yet.