MultiSpider 2.0: Multilingual SQL Benchmark
- The paper introduces MultiSpider 2.0 as a multilingual benchmark that challenges LLMs with complex enterprise schemas and eight diverse languages.
- The paper demonstrates that standalone LLMs suffer a drastic performance drop to 4–6% execution accuracy in non-English contexts, highlighting current reasoning limitations.
- The paper presents the COLA framework, a collaborative agent system that incrementally improves SQL generation accuracy to 12–16% by decomposing the problem into modular tasks.
MultiSpider 2.0 is a multilingual extension of the Spider 2.0 Text-to-SQL benchmark, designed to assess the capabilities of LLMs and collaborative language agents in generating SQL queries from natural language (NL) prompts across eight typologically diverse languages. The benchmark preserves the structural difficulty and complex schema distributions of its predecessor while introducing linguistic and dialectal variability. MultiSpider 2.0 explicitly targets enterprise-scale, highly compositional SQL generation scenarios, exposing the limitations of current LLM reasoning, especially outside English and across disparate SQL dialects (Pham et al., 29 Sep 2025).
1. Dataset Construction and Linguistic Scope
MultiSpider 2.0 inherits enterprise-scale schemas (≥200 columns and nested structures) from Spider 2.0 and localizes them into the following eight languages: English (en), German (de), French (fr), Spanish (es), Portuguese (pt), Japanese (ja), Chinese (zh), and Vietnamese (vi). The dataset selection ensures coverage of both major European languages with divergent syntax/morphology and East Asian languages with non-Latin scripts, complex tokenization patterns, and code-switching phenomena.
The corpus comprises 5,056 NL–SQL pairs, uniformly split with 632 examples per language, constructed using 200 real enterprise databases originating from BigQuery public datasets, Snowflake Marketplace, and SQLite snapshots. The translation/localization pipeline incorporates:
- Initial translations and schema-value dictionaries by professional translators (non-English) and NLP researchers (English).
- Bilingual alignment and schema-link verification for accurate mapping between question phrases and database schema elements.
- Per-language database snapshots for authentic localization of schema elements while preserving canonical schema IDs.
- Four iterative rounds of NLP review (smoke testing, faithfulness checks, cross-lingual equivalence verification).
Dialectal variation is incorporated by providing BigQuery, Snowflake, and SQLite SQL dialects per instance, distributed as shown:
| SQL Dialect | Percentage (%) |
|---|---|
| BigQuery | 33.86 |
| Snowflake | 31.33 |
| SQLite | 34.81 |
The dataset also enforces compositional and structural diversity: 22.15% of examples involve multiple schemas, 18.51% include nested schemas, 8.54% use partition tables, and 75% invoke SQL functions. Query length is stratified into easy (<80 tokens, 25.32%), medium (80–160 tokens, 44.15%), and hard (>160 tokens, 30.54%).
2. Schema Complexity and Difficulty Metrics
MultiSpider 2.0 maintains Spider 2.0's compositional hardness through equivalent schema complexity and query-structure distributions. The structural challenge of each SQL query is quantified by three parameters:
- : number of distinct tables joined
- : maximum depth of nested subqueries
- : number of SQL clauses (e.g., GROUP BY, HAVING)
The composite difficulty score for a query is:
where in diagnostic settings. By design, the and histograms match Spider 2.0, while the linguistic challenge is increased via non-English NL complexity.
3. Evaluation Protocols and Primary Metrics
Benchmark evaluation employs standard Text-to-SQL metrics:
- Exact Matching (EM): Measures string-level canonical equivalence between generated and reference SQL.
where denotes SQL canonicalization.
- Execution Accuracy (EX): Assesses semantic correctness by comparing query execution results.
- Pass@N: Reports the fraction for which at least one of the top- candidates executes correctly.
A salient observation is that EX greatly exceeds EM (EX ≫ EM), indicating that models frequently generate semantically correct yet syntactically divergent SQL.
4. Baseline Model Performance
On the prior MultiSpider 1.0 benchmark, state-of-the-art LLMs (e.g., OpenAI-o1, DeepSeek-R1-Qwen-70B) achieve approximately 80% execution accuracy. However, MultiSpider 2.0 exposes a severe performance collapse for all reasoning-first models, with execution accuracy dropping to 4–6% across languages.
| Model | MultiSpider 1.0 EX (en) | MultiSpider 2.0 EX (en) |
|---|---|---|
| OpenAI-o1-1217 | 79.7% | 4.4% |
| DeepSeek-R1-Qwen-70B | 80.0% | 5.8% |
This drop reveals that intrinsic LLM reasoning is inadequate for the increased multilingual, dialectal, and compositional complexity presented by MultiSpider 2.0.
5. COLA: Collaborative Language Agent Baseline
The Collaborative Language Agent (COLA) method addresses the challenge by decomposing Text-to-SQL parsing into three modular agents operating in an iterative loop:
- Classifier: Selects relevant sub-databases.
- Analyzer: Decomposes the NL question into subquestions.
- Backbone LLM: Generates SQL for each subquestion.
- Assembler: Composes the partial SQL sequences.
- Corrector: Refines the composed query for syntactic and schema correctness.
The formal procedure is:
1 2 3 4 5 6 7 8 9 10 |
Input: NL question Q, schema D
Output: Executable SQL Ŝ
1. dbs ← Classifier(Q, D)
2. {(q_i, D_i)} ← Analyzer(Q, dbs)
3. For each i:
s_i ← BackboneLLM.generate(q_i, D_i)
4. Ŝ ← Assemble(s_1, …, s_k)
5. Ŝ′ ← Corrector(Ŝ, D)
Return Ŝ′ |
Plugging existing backbone LLMs into COLA raises execution accuracy on MultiSpider 2.0 to the 12–16% range. For instance, COLA + OpenAI-o1 achieves 15.9% execution accuracy on English, with similar improvements observed across other target languages.
Ablation analysis demonstrates additive gains for each agent:
- Backbone alone: 5.6% EX
- Classifier: +2.4 percentage points
- Analyzer: +3.6
- Corrector: +3.8 (final 15.4%)
6. Analytical Findings and Error Taxonomy
MultiSpider 2.0 exposes several critical failure modes in current LLMs and agentic solutions:
- Non-English languages incur an average 6 percentage point execution accuracy gap relative to English, driven by tokenization idiosyncrasies, code-switching, and pretraining bias.
- The predominant error type is incorrect schema linking (33%), followed by erroneous analysis (20%), misplanning (18%, e.g., join paths or aggregates), and SQL syntax errors (<10%).
- Typical failure cases include over-extended join paths, omitted intermediate tables, misconfigured aggregation/group-by hierarchies, and mismatches in date/number formats.
The data supports that linguistic and dialectal variability, rather than purely structural challenge, forms the principal bottleneck for multilingual Text-to-SQL performance.
7. Implications and Future Research Directions
MultiSpider 2.0 is positioned as a more realistic and competitive benchmark for enterprise-oriented Text-to-SQL systems. The results motivate future research in:
- Schema-grounded planning using explicit join-path heuristics and constraint propagation.
- Dialect-aware normalization, including alias handling, transliteration, and regional synonym resolution.
- Execution-grounded learning paradigms such as verifier-guided search and reinforcement learning with budget-aware stopping criteria.
- Multilingual data augmentation, leveraging paraphrasing, dialectal variants, and contrastive example generation.
The persistent gap between intrinsic LLM performance and collaborative agent-driven accuracies, even with COLA, suggests the need for more robust and linguistically adaptive methods to achieve reliable, real-world deployment in multilingual enterprise settings (Pham et al., 29 Sep 2025).