Text-to-SQL Benchmark: Evaluation & Insights

Updated 22 May 2026

Text-to-SQL benchmarks are systematic evaluation resources that assess systems' capability to translate natural language into executable SQL queries.
They measure schema complexity, query intent diversity, and robustness across various domains using standardized metrics.
Advanced synthesis methods and human-in-the-loop annotation improve reliability, coverage, and granular evaluation.

Text-to-SQL benchmarks are systematic evaluation resources for measuring and advancing the ability of natural language systems to generate semantically correct and executable SQL queries given free-form natural language questions and database schemas. These benchmarks underlie the empirical progress of semantic parsing, providing standardized metrics, schema complexity, robustness diagnostics, and coverage of real-world query intent, syntactic constructs, and domain phenomena.

1. Benchmark Structures and Coverage

Text-to-SQL benchmarks vary along several critical axes: the schema complexity (single-table vs. multi-table, flat vs. normalized schemas), cross-domain vs. domain-specific focus, language coverage, and the diversity of SQL constructs and intents present.

Cross-domain, multi-table: Benchmarks such as Spider, BIRD, and UNITE span hundreds of schemas from diverse domains (e.g., music, finance, education). Schemas average 4–10 tables and 30+ columns, with SQL queries involving multi-way JOINs, subqueries, GROUP BY/HAVING, and set operations (Pourreza et al., 2023, Lan et al., 2023).
Single-table, template-based: WikiSQL restricts all queries to one table, without JOINs or heavy aggregation (Pourreza et al., 2023).
Domain-specific: Recent work develops focused benchmarks for process mining (text-2-SQL-4-PM (Yamate et al., 18 Aug 2025)), financial analysis (BULL in FinSQL (Zhang et al., 2024)), geospatial reasoning (FloodSQL-Bench (Liu et al., 12 Dec 2025)), and enterprise data warehouses (BEAVER (Chen et al., 2024)).
Multilingual and cross-lingual: MultiSpider 2.0 extends Spider to 8 languages with schema-localization, and IndicDB targets complex Indian administrative schemas with 7 language variants (Pham et al., 29 Sep 2025, Dawar et al., 15 Apr 2026).
Reasoning complexity: LogicCat introduces multi-domain, chain-of-thought-annotated benchmarks for physics, mathematics, and hypothetical reasoning, with explicit multi-step SQL synthesis and reasoning traces (Liu et al., 24 May 2025).
Ambiguity and robustness: AmbiQT supplies parallel gold SQLs to each ambiguous question, while Dr.Spider introduces 17 systematic perturbation types to test robustness to schema, NLQ, and SQL-level shifts (Bhaskar et al., 2023, Chang et al., 2023).

2. Taxonomies and Dataset Synthesis Methodologies

Advanced benchmarks utilize multi-dimensional taxonomies to ensure comprehensive coverage:

Dimensions: Core user intent (14 types: e.g., aggregation, formatting, statistical analysis), statement type (SELECT, INSERT, UPDATE, DELETE, ALTER), syntactic structures (14 clause types: WHERE, JOIN, GROUP BY, subqueries, set ops), and key actions (e.g., time functions, JSON, window functions) (Wang et al., 17 Nov 2025).
Taxonomy-guided synthesis: SQL-Synth systematically combines all valid Cartesian products of the taxonomy to guide LLM-based NLQ–SQL pair generation, post-processing, and semantic validation. This approach achieves 100% category coverage, unlike Spider (CoreIntent: 79%, StmtType: 20%, Syntax: 71%, KeyAction: 33%) (Wang et al., 17 Nov 2025).
Value-aware, join-enforced pipelines: IndicDB structures denormalized government datasets into schema graphs with up to 6 FK-hops, employing an Architect–Auditor–Refiner LLM trio to guarantee high relational density and realistic multi-table coverage (Dawar et al., 15 Apr 2026).
Annotation protocols: Some resources combine expert curation (e.g., text-2-SQL-4-PM), professional translation, manual SQL verification, and multilayered qualifiers (e.g., query perspective, select/where clause coverage, aggregation usage, question word, vocabulary linkage) (Yamate et al., 18 Aug 2025).

3. Evaluation Metrics and Task Decomposition

Evaluation protocols have evolved to address the limitations of pure string-based metrics and to support granular diagnostics:

Metric	Definition/Role
Exact-Set-Match (EM)	Set-equivalence of SQL components (SELECT, WHERE, etc.); ignores order/literals (Pourreza et al., 2023)
Execution Accuracy (EX)	Equality of result sets from model and gold SQL on test DBs (Pourreza et al., 2023, Wang et al., 17 Nov 2025)
Structural Equivalence	Canonicalized SQL parse-tree match (structure accuracy) (Yamate et al., 18 Aug 2025)
Component-Level Accuracy	Clause-wise exact match averaged over SQL clauses (Lan et al., 2023)
Taxonomy-wise Performance	Coverage and performance per taxonomy axis/subcategory (Wang et al., 17 Nov 2025)
Robustness Drop / RRA	Absolute and relative robustness drops after perturbation (Chang et al., 2023)
Query Similarity (SQAM, TSED)	Clause- and AST-based no-execution similarity metrics; avoid running on databases (Song et al., 2023)
Reliability / Penalty-based	Assign +1, 0, or –c per correct, abstained, or incorrect answer under mixed answerable/unanswerable (Lee et al., 2024)
Workload Alignment/Complexity	Distributional reweighting, complexity consistency, execution time constraints (Abedini et al., 23 Apr 2026)

Composite and workload-aligned platforms (e.g., SQLyzr) capture not only correctness but also structural efficiency, complexity inflation, execution speed, and token consumption—with per-category reporting (Abedini et al., 23 Apr 2026).

Fine-grained error analysis dissects model failures into schema linking, aggregation, filter, and join errors, augmented by normalization and fix-suggestion modules.

4. Empirical Coverage, Robustness, and Benchmarking Insights

Recent comparative studies reveal substantial performance variation by dataset, language, domain, and query type:

Cross-domain and compositional: On UNITE (12+ domains, 3.9K SQL patterns), no model exceeds 50% execution accuracy; in-domain performance (Spider/BIRD) can reach 85–86% EX (Lan et al., 2023).
Diversity gaps: Public benchmarks have skewed coverage: SELECT+WHERE accounts for >70% of Spider, while complex constructs (UPDATE/DELETE/ALTER, JSON, window functions, correlated subqueries) are nearly absent (Wang et al., 17 Nov 2025).
Domain-specific: BEAVER’s queries average over 4 joins, nearly 10 columns/table, and employ business-centric logic; even top LLMs drop from ~68% (Spider) to 2% execution accuracy on BEAVER (Chen et al., 2024). BULL (finance) features 26 tables/DB and multi-lingual NLQ–SQL pairs, with FinSQL achieving up to 82.2% EX (en), 76.6% (cn) via LoRA tuning (Zhang et al., 2024).
Robustness: Dr.Spider shows SOTA models (e.g., Picard) drop by 14% EX under perturbation, and up to 50.7% for content-equivalence types (e.g., age→birthyear) (Chang et al., 2023).
Ambiguity: On AmbiQT, standard beam search recalls both gold SQLs for only 12–28% of ambiguous questions; new logical decoding algorithms can double this (Bhaskar et al., 2023).
Multilingual: MultiSpider 2.0 and IndicDB show sharp performance drops: English-to-non-English (–9% EX, “Indic Gap”); East Asian scripts fare worse due to schema-token mismatch and limited training data (Pham et al., 29 Sep 2025, Dawar et al., 15 Apr 2026).
Chain-of-thought: LogicCat demonstrates that conventional SOTA (<15% EX) cannot handle multi-step mathematical/physical reasoning, but CoT annotation boosts top model EX to ~34% (Liu et al., 24 May 2025).

5. Annotation Systems and Benchmark Lifecycle

Increasing dataset complexity magnifies the challenges of annotation. Human-in-the-loop platforms (e.g., BenchPress) leverage retrieval-augmented LLMs to generate NL descriptions of SQL logs, which human experts validate, yielding annotation accuracy of 93.0% (vs. 73.9% manual), and 5–6× throughput gains (Wenz et al., 11 Oct 2025). Automated synthesis pipelines now employ dual-path diversity expansions, cross-lingual paraphrasing, value- and schema-aware filtering, and rigorous audit/review to ensure both breadth and semantic fidelity (Wang et al., 17 Nov 2025, Dawar et al., 15 Apr 2026).

6. Impact and Future Directions

Text-to-SQL benchmarks have advanced from constrained, single-table, English-only settings to extensive, cross-domain, multi-lingual, and domain-specialized testbeds with explicit taxonomic design and robustness diagnostics. Current results consistently expose:

A persistent gap in generalization and robustness across domains, languages, and query complexities.
Substantial limitations in handling ambiguous, adversarial, or enterprise-scaled schemas.
The need for benchmarks coupling fine-grained, taxonomy-driven coverage, dynamic workload alignment, and multi-dimensional evaluation (structural, semantic, efficiency, reliability).
The importance of human-in-the-loop architectures for scalable, high-precision data curation and future multi-agent evaluation settings.

Ongoing research emphasizes multi-agent and reasoning-augmented architectures, retrieval-enhanced pipelines, schema-normalization for non-Western data, and end-to-end reliability under penalty-based scoring (Zhang et al., 2024, Ahmed et al., 6 Nov 2025, Lee et al., 2024).

These benchmarks continue to inform the development of increasingly robust, interpretable, and production-ready Text-to-SQL systems across scientific, business, and multilingual applications.