Spider Dataset for Semantic Parsing

Updated 1 April 2026

Spider dataset is a large-scale, human-annotated benchmark for semantic parsing, mapping natural language queries to executable SQL across diverse schemas.
It utilizes a rigorous multi-stage annotation protocol, ensuring each natural-language question is paired with a unique and validated SQL query.
Extensions include multilingual and multi-formalism variants, promoting evaluation of model generalization, compositionality, and cross-linguistic robustness.

The Spider dataset is a large-scale, human-annotated resource designed to advance research in complex, cross-domain semantic parsing and text-to-SQL modeling. Iteratively extended with multilingual and multi-formalism variants, Spider and its derivatives serve as key benchmarks for evaluating generalization, compositionality, and cross-linguistic robustness in semantic parsing. The dataset’s structure and evaluation protocol incentivize models to generate executable queries for novel database schemas, challenging both natural language understanding and database reasoning components.

1. Dataset Composition and Annotation Protocol

Spider comprises 10,181 questions paired with 5,693 unique SQL queries spanning 200 relational databases across 138 domains. Database schemas are complex, with an average of 5.1 tables, 27.6 columns, and 8.8 foreign keys per database. Question complexity ranges from simple selections to deeply nested and compositional SQL queries; 14.8% include nested subqueries, 26.2% contain GROUP BY clauses, and 37.4% require aggregation.

Annotation was performed via a multi-stage process involving SQL-proficient annotators: collection of diverse schemas, question and SQL authoring, iterative review, paraphrasing for naturalness, and a final audit with execution validation. Each question is aligned to a unique SQL gold parse. No overlap in database schemas or SQL programs is permitted between the training and test sets, enforcing cross-domain generalization (Yu et al., 2018).

2. Task Formulation and Evaluation Metrics

Spider formalizes text-to-SQL as predicting, for a natural-language question $x$ and database schema $\mathcal{S}$ , an executable SQL query $y$ such that $y$ executed on $\mathcal{S}$ yields the correct answer, independent of data instance. The model function is

$f: (x, \mathcal{S}) \mapsto \hat{y}, \quad \hat{y} \in \mathcal{Y}(\mathcal{S})$

with $\mathcal{Y}(\mathcal{S})$ the set of valid SQL queries over $\mathcal{S}$ . Evaluation comprises:

Exact Match (EM): $\mathrm{EM} = \frac{\#\{\hat{y} = y\}}{\#\{\text{examples}\}}$ — syntactic skeleton match, ignoring literal values but requiring structural identity.
Execution Accuracy (EA): $\mathrm{EA} = \frac{\#\{\text{exec}(\hat{y}) = \text{exec}(y)\}}{\#\{\text{examples}\}}$ — outputs must yield identical DB results.
Component-level F1: Matching at clause granularity (SELECT, WHERE, GROUP BY, etc.) as unordered sets.

Difficulty is stratified via compositional patterns (“easy”, “medium”, “hard”, “extra-hard”) (Yu et al., 2018).

3. Cross-Domain and Zero-Shot Generalization

Spider is explicitly designed for zero-shot transfer: test databases and queries are unseen during training, precluding memorization and necessitating schema reasoning and lexical-semantic mapping. This property has driven architectural advances such as:

Schema encoding with Graph Neural Networks: Representing the schema as a typed graph and propagating question-conditioned embeddings boosts performance, especially for complex join queries and multi-table reasoning (Bogin et al., 2019).
Global Reasoning Mechanisms: Methods incorporating global gating and discriminative re-ranking using GCNs improve EM by up to 8 points, particularly by enforcing coverage and consistency in schema item selection (Bogin et al., 2019).
Neural Data Synthesis Approaches: Purely neural hierarchical architectures factorize data generation as schema→entity→question→SQL, facilitating data augmentation—especially in zero-shot and domain-extension scenarios (Yang et al., 2021).

4. Data Augmentation and Synthetic Data for Semantic Parsing

Because large labeled datasets are expensive to create, recent research leverages automatic data synthesis. Key methods include:

Hierarchical Neural Synthesis: Decomposes $\mathcal{S}$ 0 into entity sampling, question generation (T5), and self-labeling with a strong parser. Zero-shot synthesis generalizes to unseen schemas, producing (question, SQL) pairs for efficient augmentation. On Spider, this method achieves state-of-the-art dev EM of 77.2% (Yang et al., 2021).
PCFG+BART Synthesis: Employs a non-neural PCFG to sample structurally novel SQL queries, then translates them to natural language using a finetuned BART model. This increases both set-match and execution accuracy over strong baselines when combined with fine-tuning (Wang et al., 2021).
Cycle-Consistent Data Selection (GAZP): Engages bidirectional modeling (parser/generator) and enforces cycle consistency via execution, selecting only those synthetic pairs where the cycle is closed. This method yields empirical improvements of 3–4 percentage points in both EM and EX in zero-shot adaptation to new schemas (Zhong et al., 2020).

Augmentation increases both the diversity and the decorrelation of schema elements from sketch-level programs, increasing generalization and reducing spurious correlations (Yang et al., 2021).

5. Multilingual Diversification: MultiSpider and Cross-Lingual Variants

Spider has inspired several multilingual and cross-lingual extensions:

MultiSpider extends Spider to seven languages (English, German, French, Spanish, Japanese, Chinese, Vietnamese), each comprising 9,691 question–SQL pairs over 166 databases. Annotation included human post-editing and schema translation with high inter-annotator agreement. Non-English exact-match accuracy lags English by 6.1% on average due to amplified lexical and structural mapping difficulties (Dou et al., 2022).
Chinese Spider (CSpider): Human-verified, stylistically diverse Chinese translations of the English Spider dataset. Due to the need for segmentation and cross-lingual embedding alignment between Chinese questions and English schema/keywords, character-level encoding with cross-lingual embeddings outperforms word-based methods. Maximum exact-match accuracy achieved is 12.1%, ~2 points below the English baseline (Min et al., 2019).
Ar-Spider: Arabic–English parallel corpus covering 9,691 questions and 166 databases. Cross-lingual models (LGESQL, S2SQL with XLM-R encoders) achieve up to 66.63% EM (Arabic), reducing the gap from English (74.36% EM) to 7.73%. Context Similarity Relationship edges in the schema graph, derived from LASER embeddings, yield further gains (Almohaimeed et al., 2024).

Schema linking remains the primary bottleneck in cross-lingual semantic parsing, with augmentation (e.g., SAVe for verified schema paraphrases) recovering a substantive fraction of the interlingual gap (Dou et al., 2022).

6. Multi-Formalism and Unified Query Benchmarks

Extensions such as Spider4SSC provide triply aligned text-to-query mappings (SQL, SPARQL, Cypher) over 4,525 questions from 159 databases, using automatic rule-based translation (S2CLite parser) to convert SPARQL to Cypher. S2CLite, a purely rule-based, ontology-agnostic pipeline, increases parsing accuracy from 44.2% to 77.8% and achieves 96.6% execution accuracy on intersected parsed queries, enabling evaluation and model pretraining across multiple query languages (Vejvar et al., 12 Nov 2025).

These resources advance the paradigm from text-to-SQL to more general text-to-query or neural symbolic reasoning tasks, supporting evaluation and pretraining for cross-formalism, multimodal parsers.

7. Impact, Limitations, and Open Challenges

Spider and its extensions have defined the state-of-the-art in semantic parsing evaluation, revealing sharp drops in EM for methods unable to generalize compositionally, or to align question tokens to diverse schemas. Key limitations and future challenges include:

Incomplete coverage of long-tail SQL constructs even with synthetic augmentation (Yang et al., 2021).
Persistent cross-lingual schema-lexicon mismatch, mitigated but not eliminated by embedding-space similarity edges and paraphrase augmentation (Dou et al., 2022, Almohaimeed et al., 2024).
Remaining performance gaps on “extra-hard” queries involving complex nesting, set operations, and multi-table joins.
The need for models that integrate more sophisticated type reasoning, error detection, and joint reasoning over schema, query, and context (Bogin et al., 2019, Rubin et al., 2020).
Efficient, accurate data synthesis pipelines that ensure both coverage and noise robustness across languages and domains.

Rapid domain bootstrapping, more adaptive schema linking, and multi-formalism models are active research frontiers, with the Spider benchmark serving as the primary driver for progress in neural semantic parsing and beyond.