Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Ar-Spider: Text-to-SQL in Arabic (2402.15012v1)

Published 22 Feb 2024 in cs.CL and cs.AI

Abstract: In NLP, one of the most important tasks is text-to-SQL semantic parsing, which focuses on enabling users to interact with the database in a more natural manner. In recent years, text-to-SQL has made significant progress, but most were English-centric. In this paper, we introduce Ar-Spider 1, the first Arabic cross-domain text-to-SQL dataset. Due to the unique nature of the language, two major challenges have been encountered, namely schema linguistic and SQL structural challenges. In order to handle these issues and conduct the experiments, we adopt two baseline models LGESQL [4] and S2SQL [12], both of which are tested with two cross-lingual models to alleviate the effects of schema linguistic and SQL structure linking challenges. The baselines demonstrate decent single-language performance on our Arabic text-to-SQL dataset, Ar-Spider, achieving 62.48% for S2SQL and 65.57% for LGESQL, only 8.79% below the highest results achieved by the baselines when trained in English dataset. To achieve better performance on Arabic text-to-SQL, we propose the context similarity relationship (CSR) approach, which results in a significant increase in the overall performance of about 1.52% for S2SQL and 1.06% for LGESQL and closes the gap between Arabic and English languages to 7.73%.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. Mikel Artetxe and Holger Schwenk. 2019. Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond. In Transactions of the Association for Computational Linguistics, Vol. 7. 597–610.
  2. PAUQ: Text-to-SQL in Russian. In Findings of the Association for Computational Linguistics: EMNLP 2022. 2355–2376.
  3. Language Models are Few-Shot Learners. Advances in Neural Information Processing Systems 33 (2020).
  4. LGESQL: line graph enhanced text-to-SQL model with mixed local and non-local relations. arXiv preprint arXiv:2106.01093 (2021).
  5. Electra: Pre-training text encoders as discriminators rather than generators. arXiv preprint arXiv:2003.10555 (2020).
  6. Unsupervised Cross-lingual Representation Learning at Scale. CoRR abs/1911.02116 (2019). arXiv:1911.02116 http://arxiv.org/abs/1911.02116
  7. Code generation using machine learning: A systematic review. Ieee Access (2022).
  8. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. CoRR abs/1810.04805 (2018). arXiv:1810.04805 http://arxiv.org/abs/1810.04805
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
  10. MultiSpider: towards benchmarking multilingual text-to-SQL semantic parsing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 12745–12753.
  11. AdaSL: an unsupervised domain adaptation framework for arabic multi-dialectal sequence labeling. Information Processing & Management 59, 4 (2022), 102964.
  12. S2SQL: Injecting Syntax to Question-Schema Interaction Graph Encoder for Text-to-SQL Parsers. arXiv preprint arXiv:2203.06958 (2022).
  13. Marcelo Archanjo José and Fabio Gagliardi Cozman. 2021. mRAT-SQL+ GAP: a Portuguese text-to-SQL transformer. In Intelligent Systems: 10th Brazilian Conference, BRACIS 2021, Virtual Event, November 29–December 3, 2021, Proceedings, Part II 10. Springer, 511–525.
  14. Self-training pre-trained language models for zero-and few-shot multi-dialectal Arabic sequence labeling. arXiv preprint arXiv:2101.04758 (2021).
  15. Guillaume Lample and Alexis Conneau. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems (NeurIPS).
  16. An empirical study of pre-trained transformers for Arabic information extraction. arXiv preprint arXiv:2004.14519 (2020).
  17. A pilot study for Chinese SQL semantic parsing. arXiv preprint arXiv:1909.13293 (2019).
  18. A pilot study of text-to-SQL semantic parsing for Vietnamese. arXiv preprint arXiv:2010.01891 (2020).
  19. Dat Quoc Nguyen and Anh Tuan Nguyen. 2020. PhoBERT: Pre-trained language models for Vietnamese. arXiv preprint arXiv:2003.00744 (2020).
  20. GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.
  21. A survey on text-to-sql parsing: Concepts, methods, and future directions. arXiv preprint arXiv:2208.13629 (2022).
  22. Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 3982–3992.
  23. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401 (2020).
  24. Google’s neural machine translation system: Bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144 (2016).
  25. Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases. arXiv preprint arXiv:1909.05378 (2019).
  26. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. arXiv preprint arXiv:1809.08887 (2018).
  27. Sparc: Cross-domain semantic parsing in context. arXiv preprint arXiv:1906.02285 (2019).
  28. Recent advances and challenges in task-oriented dialog systems. Science China Technological Sciences 63, 10 (2020), 2011–2027.
  29. Seq2sql: Generating structured queries from natural language using reinforcement learning. arXiv preprint arXiv:1709.00103 (2017).
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Saleh Almohaimeed (6 papers)
  2. Saad Almohaimeed (4 papers)
  3. Mansour Al Ghanim (3 papers)
  4. Liqiang Wang (51 papers)
Citations (1)