Spider 2.0: Enterprise Text-to-SQL Benchmark

Updated 19 November 2025

Spider 2.0 is a benchmark framework that assesses language models and agentic systems using real-world, enterprise-scale text-to-SQL tasks with complex, multi-schema databases.
It introduces innovative metrics such as execution accuracy and success rate, utilizing diverse SQL dialects and iterative refinement to mimic production database challenges.
The framework incorporates agentic interfaces and workflow realism to drive robust, multi-step query generation and debugging in dynamic, large-scale enterprise settings.

Spider 2.0 is a rigorous benchmark framework designed to evaluate the capabilities of LLMs and agentic systems on real-world, enterprise-scale text-to-SQL and SQL-centric workflow tasks. It represents a substantive progression from prior datasets by introducing workflow realism, massive schema complexity, multi-dialect SQL, and agentic interfaces that more accurately reflect the demands of production database environments (Lei et al., 12 Nov 2024).

1. Dataset Characteristics and Motivations

Spider 2.0 addresses several critical limitations of earlier datasets—such as Spider 1.0, WikiSQL, and BIRD—which predominantly comprised toy schemas, single-dialect SQL, and one-shot question-SQL pairs. The main design goals are:

Enterprise Databases: Inclusion of terabyte-scale, multi-schema databases with nested and partitioned tables, averaging 700–800 columns per schema and up to 3,000 columns in extreme cases.
Dialect Diversity: Coverage of BigQuery, Snowflake, SQLite, DuckDB, Postgres, and ClickHouse, incorporating dialect-specific constructs such as JSON extraction, LATERAL FLATTEN, and varied date-arithmetic expressions.
Workflow Realism: Tasks reflect authentic project-level operations, requiring interaction with codebases (e.g., DBT models), auxiliary documentation (function references, Google Analytics groupings), and iterative debugging.
Agentic Interfaces: Benchmarks go beyond static generation by requiring agents to interactively reason, act, and observe within real shell/DB environments, executing multi-step plans and adaptively refining outputs (Lei et al., 12 Nov 2024).

Table: Comparison of Dataset Statistics (Test Splits)

Dataset	#Examples	#Databases	Avg #Cols/DB	Dialects
Spider 1.0	2147	40	27.1	Standard SQL
Spider 2.0-lite	547	158	803.6	BigQuery, Snowflake, SQLite
Spider 2.0-snow	547	152	812.1	Snowflake
Spider 2.0 (agentic)	632	213	743.5	BigQuery, Snowflake, ...

Key task types span simple lookups, multi-join analytics, nested/grouped transforms, and advanced reasoning predicated on documentation and external metadata (Deng et al., 2 Feb 2025).

2. Evaluation Metrics and Scoring Protocols

Spider 2.0 deployment introduces context-sensitive evaluation metrics:

Execution Accuracy (EX): An output SQL is considered correct if it returns a multiset of rows identical to the reference when executed on the correct database. The scoring tolerates superfluous columns in the SELECT clause provided all requested columns are present and the core answer matches. The official script, used for both Spider 2.0-Lite and -Snow, performs this directly on cloud environments (Deng et al., 2 Feb 2025, Lei et al., 12 Nov 2024).
Exact Match (EM): Measures the rate of exact, canonicalized SQL string matches (less central, as semantic but structurally different SQLs often produce equal results).
Success Rate (SR): For the agentic variant, an agent 'succeeds' if it reaches a correct final output (table, CSV, or narrative) through iterative multi-step workflows.
Pass@N: Fraction for which at least one of the top N candidates yields the correct answer.

This protocol reveals critical performance gaps in handling complex workflow realism: while models achieve >90% EX on Spider 1.0, the rates on Spider 2.0 frequently remain below 30%—and often closer to 5–25%—across various agent and generation frameworks (Deng et al., 2 Feb 2025, Lei et al., 12 Nov 2024).

3. Core Architectural Challenges

Spider 2.0 poses unique challenges derived from its data and task settings:

Long Context Windows: Schemas routinely exceed LLM input limits. The necessity to process supplementary documentation and sampled cell values further compounds this (Deng et al., 2 Feb 2025).
Schema Linking at Scale: With 700+ columns, naïve linking approaches lead to a high failure rate (27.6% of SQL errors), as correct column/table resolution is intrinsically difficult.
Dialect Fragmentation: Significant variation between dialects, notably across BigQuery and Snowflake, with differing syntax for JSON manipulation, windowing, and aggregation, substantially degrades model performance.
Reasoning over Workflows: Many tasks require not a single SQL query but coordinated multi-step execution, as seen in DBT pipeline completion, analytics pipeline debugging, and data wrangling (Lei et al., 12 Nov 2024).
Grounding in Code and Documentation: External metadata (e.g., function definitions, analytics channel groupings) is frequently essential for accurate query formation, increasing the need for robust retrieval-augmented reasoning.

4. Algorithmic Advances: The ReFoRCE Agent

ReFoRCE advances state-of-the-art Text-to-SQL performance on Spider 2.0 through a modular pipeline (Deng et al., 2 Feb 2025):

Database Information Compression: Pattern-based table grouping reduces redundant schema representations, further processed via LLM-guided schema linking:

$\mathrm{compress}(D) = \bigcup_{k=1}^M \left( \mathrm{DDL}(t_k^*) \cup \{ \mathrm{name}(t) \;|\; t \in C_k \setminus \{t_k^*\} \} \right),$

where clusters $C_k$ group tables by prefix/suffix patterns (e.g., date-stamped tables) and only designate one representative per group with a full DDL for context.

Iterative Column Exploration: The agent issues exploratory queries to dynamically probe nested/unknown column values, invoking correction routines when execution errors arise, and using returned values to update schema understanding.
Self-Refinement Loop: The agent iteratively generates candidate SQL, executes, and refines it, stopping upon output self-consistency (same output is produced in repeated attempts).
Format Restriction and CTE-Localization: Stringent output format enforcement (e.g., column/row constraints) reduces malformed responses. For complex failures, the system rewrites the query using CTEs, debugging failing sub-queries in isolation.
Parallelization and Majority-Vote Consensus: $R$ independent runs (compression → exploration → self-refinement) are launched, yielding candidates $\{A_1,\ldots,A_R\}$ . Final output is selected via majority vote, or LLM arbitration for ambiguous/tied outcomes.

Ablation studies demonstrate the necessity of each module: disabling schema compression or column exploration causes a 3–4% EX degradation; disabling format restriction reduces performance by ≈3%.

Performance summary (Spider 2.0-snow, o1-preview backbone):

Method	Execution Accuracy (EX)
ReFoRCE	26.69%
Spider-Agent (o1-preview)	20.29%
DAIL-SQL (GPT-4o)	2.20%
CHESS (GPT-4o)	1.28%
DIN-SQL (GPT-4o)	0.00%

This marks a large gap relative to "classic" benchmarks (e.g., DIN-SQL+GPT-4o achieves 85.3% on Spider 1.0) and underscores Spider 2.0’s rigor.

5. Multilingual Extension: MultiSpider 2.0

MultiSpider 2.0 extends Spider 2.0 to eight languages (en, de, fr, es, pt, ja, zh, vi), preserving schema and query complexity while introducing linguistic and dialectal diversity (Pham et al., 29 Sep 2025). Human-in-the-loop processes ensure faithful question, schema, and value localization as well as SQL executability.

Key findings:

State-of-the-art LLMs score ≈80% EX on English Spider 2.0 but only ≈4–6% on non-English MultiSpider 2.0, illustrating a substantial multilingual performance gap.
Collaborative modular language agent pipelines (COLA), which chain classifier, analyzer, and corrector agents with human-in-the-loop feedback and execution-based debugging, triple performance (≈15% EX), but significant deficits remain relative to the English-only setting.
Sources of performance loss include pretraining imbalance, script/encoding differences (e.g., logogram vs. alphabetic languages), and code-switching induced by English-origin column names embedded in non-English queries.

Table: MultiSpider 2.0 Execution Accuracy (EX) by Language

Model	en	fr	ja	zh
Gemini 1.5 Pro	4.87%	3.04%	3.91%	4.50%
OpenAI-o1-1217	4.37%	5.80%	4.58%	5.41%
DeepSeek-R1-70B	5.83%	5.46%	5.64%	5.89%

A pronounced ∼75%−80% performance cliff is evident between English and other languages. This demonstrates that LLMs and agentic frameworks require fundamental advances for robust deployment in multilingual, dialect-rich, and real enterprise settings.

6. Future Directions and Open Challenges

Spider 2.0 highlights several unsolved problems and promising research directions:

Long-Context Retrieval: Efficient compression, retrieval, and schema pruning for thousands of columns and accompanying documentation are necessary to avoid context overflow.
Symbolic–Neural Hybrids: Integration of symbolic components (for intermediate SQL validation) within agentic LLM pipelines may mitigate hallucinations and accelerate debugging.
Execution-Grounded and RL-Based Training: Feedback from query execution and reinforcement learning from human feedback (RLHF) to directly optimize for semantic correctness, beyond surface-level SQL matches.
Dialect Robustness and Schema Adaptation: Dynamically normalizing SQL syntax, supporting code-switch normalization, and adapting models to evolving schemas and external knowledge sources.
Multilingual Generalization: Targeted fine-tuning (e.g., on contrastive paraphrase/counterfactual samples) and transliteration for multilingual schema linking are critical for closing the multilingual gap (Pham et al., 29 Sep 2025).

This suggests that robust, adaptive, and execution-aware agent architectures—potentially leveraging policy-gradient or MCTS refinement and enhanced schema linking—are necessary to surpass the current 26–27% EX frontier on Spider 2.0 and achieve viable deployment in actual production contexts (Deng et al., 2 Feb 2025).

PDF Markdown Chat (Pro)

References (3)

Spider 2.0: Evaluating Language Models on Real-World Enterprise Text-to-SQL Workflows (2024)

ReFoRCE: A Text-to-SQL Agent with Self-Refinement, Consensus Enforcement, and Column Exploration (2025)

Multilingual Text-to-SQL: Benchmarking the Limits of Language Models with Collaborative Language Agents (2025)

Follow Topic

Get notified by email when new papers are published related to Spider 2.0 Benchmark.