Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MultiSpider: Towards Benchmarking Multilingual Text-to-SQL Semantic Parsing (2212.13492v1)

Published 27 Dec 2022 in cs.CL

Abstract: Text-to-SQL semantic parsing is an important NLP task, which greatly facilitates the interaction between users and the database and becomes the key component in many human-computer interaction systems. Much recent progress in text-to-SQL has been driven by large-scale datasets, but most of them are centered on English. In this work, we present MultiSpider, the largest multilingual text-to-SQL dataset which covers seven languages (English, German, French, Spanish, Japanese, Chinese, and Vietnamese). Upon MultiSpider, we further identify the lexical and structural challenges of text-to-SQL (caused by specific language properties and dialect sayings) and their intensity across different languages. Experimental results under three typical settings (zero-shot, monolingual and multilingual) reveal a 6.1% absolute drop in accuracy in non-English languages. Qualitative and quantitative analyses are conducted to understand the reason for the performance drop of each language. Besides the dataset, we also propose a simple schema augmentation framework SAVe (Schema-Augmentation-with-Verification), which significantly boosts the overall performance by about 1.8% and closes the 29.5% performance gap across languages.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Longxu Dou (28 papers)
  2. Yan Gao (157 papers)
  3. Mingyang Pan (3 papers)
  4. Dingzirui Wang (18 papers)
  5. Wanxiang Che (152 papers)
  6. Dechen Zhan (5 papers)
  7. Jian-Guang Lou (69 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.