From Text to CQL: Bridging Natural Language and Corpus Search Engine (2402.13740v1)
Abstract: NLP technologies have revolutionized the way we interact with information systems, with a significant focus on converting natural language queries into formal query languages such as SQL. However, less emphasis has been placed on the Corpus Query Language (CQL), a critical tool for linguistic research and detailed analysis within text corpora. The manual construction of CQL queries is a complex and time-intensive task that requires a great deal of expertise, which presents a notable challenge for both researchers and practitioners. This paper presents the first text-to-CQL task that aims to automate the translation of natural language into CQL. We present a comprehensive framework for this task, including a specifically curated large-scale dataset and methodologies leveraging LLMs for effective text-to-CQL task. In addition, we established advanced evaluation metrics to assess the syntactic and semantic accuracy of the generated queries. We created innovative LLM-based conversion approaches and detailed experiments. The results demonstrate the efficacy of our methods and provide insights into the complexities of text-to-CQL task.
- Giusepppe Attardi. 2015. Wikiextractor. https://github.com/attardi/wikiextractor.
- Methods for exploring and mining tables on wikipedia. In Proceedings of the ACM SIGKDD workshop on interactive data exploration and analytics, pages 18–26.
- Creating research environments with blacklab. CLARIN in the Low Countries, pages 245–257.
- Recent advances in text-to-SQL: A survey of what we have and what we expect. In Proceedings of the 29th International Conference on Computational Linguistics, pages 2166–2187, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Ludovic Denoyer and Patrick Gallinari. 2006. The wikipedia xml corpus. In ACM SIGIR Forum, volume 40, pages 64–69. ACM New York, NY, USA.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- C3: Zero-shot text-to-sql with chatgpt.
- Text-to-sql empowered by large language models: A benchmark evaluation.
- Andrew Hardie. 2012. Cqpweb—combining power, flexibility and usability in a corpus analysis tool. International journal of corpus linguistics, 17(3):380–409.
- Renfen Hu and Hang Xiao. 2019. The construction of chinese collocation knowledge bases and their application in second language acquisition. Applied Linguistics, (1):135–144.
- The sketch engine: ten years on. Lexicography, 1(1):7–36.
- The sketch engine. Practical Lexicography: a reader, pages 297–306.
- BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. CoRR, abs/1910.13461.
- Resdsql: decoupling schema linking and skeleton parsing for text-to-sql. In Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence, AAAI’23/IAAI’23/EAAI’23. AAAI Press.
- Can llm already serve as a database interface? a big bench for large-scale database grounded text-to-sqls. arXiv preprint arXiv:2305.03111.
- TAPEX: Table pre-training via learning a neural SQL executor. In International Conference on Learning Representations.
- The stanford corenlp natural language processing toolkit. In Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations, pages 55–60.
- Gpt-4 technical report.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Mohammadreza Pourreza and Davood Rafiei. 2023. Din-sql: Decomposed in-context learning of text-to-sql with self-correction.
- Language models are unsupervised multitask learners.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Codebleu: a method for automatic evaluation of code synthesis.
- Cpt: A pre-trained unbalanced transformer for both chinese language understanding and generation. arXiv preprint arXiv:2109.05729.
- Text-to-overpassql: A natural language interface for complex geodata querying of openstreetmap.
- Battle of the large language models: Dolly vs llama vs vicuna vs guanaco vs bard vs chatgpt – a text-to-sql parsing comparison.
- The penn treebank: an overview. Treebanks: Building and using parsed corpora, pages 5–22.
- Grammar prompting for domain-specific language generation with large language models.
- Transformation of enhanced dependencies in chinese. In Proceedings of the 21st Chinese National Conference on Computational Linguistics, pages 99–109.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-SQL task. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 3911–3921, Brussels, Belgium. Association for Computational Linguistics.
- Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887.
- Seq2sql: Generating structured queries from natural language using reinforcement learning.