Archer NL2SQL Eval Challenge 2025
- Archer NL2SQL Evaluation Challenge 2025 is a bilingual benchmark that rigorously tests NL2SQL systems on complex reasoning tasks across multi-domain databases.
- The evaluation uses advanced metrics like ROSE and strict execution accuracy to ensure semantic correctness over simple string matching.
- Innovative designs, including planning-centric architectures and plan diversification, significantly enhance query execution reliability and performance.
The Archer NL2SQL Evaluation Challenge 2025 is a bilingual, large-scale benchmark and competition designed to rigorously evaluate text-to-SQL (NL2SQL) systems on both English and Chinese inputs. The challenge targets complex reasoning scenarios—arithmetic, commonsense, and hypothetical/counterfactual—in a multi-domain, schema-diverse setting. It has become the reference venue for empirical comparison of methods targeting practical semantic understanding in NL2SQL tasks, with significant implications for evaluation methodology and dataset design (Zheng et al., 2024, Liu et al., 27 Oct 2025, Zeng et al., 11 Jun 2025, Pei et al., 14 Apr 2026).
1. Dataset Structure and Complexity
Archer consists of 1,042 English and 1,042 Chinese NL questions mapped to 521 unique SQL templates, spanning 20 databases each representing a distinct domain. The data exhibits high schema complexity:
- Average tables per DB: 7.55
- Average columns per DB: 45.25
- Average number of JOINed tables per query: 2.17
- SQL query length: ≈80 tokens
Every NL question requires arithmetic reasoning; 51.4% require commonsense, and 44.0% involve hypothetical scenarios. Arithmetic expressions include addition (34%), multiplication (47.8%), and division (62.0%) among others. Example reasoning tasks include calculating ratios, per-entity aggregations, and conditional reasoning (e.g., counterfactuals about grouped subsets) (Zheng et al., 2024).
Representative SQL examples include complex HAVING clauses, arithmetic in SELECT, ever-present aggregation, and multi-table schema navigation. Question–SQL pairs for each database reside exclusively within a single split: train (16 DBs), dev (2 DBs), and test (2 DBs).
2. Evaluation Metrics and Protocol
The challenge protocol mandates strict, reproducible metrics and public resources:
- SQL Validity (VA): Fraction of syntactically valid SQL
- Execution Accuracy (EX): Fraction producing gold-matching query results
Normalization occurs at the result-comparison stage, ignoring row order and duplicates. Evaluation operates solely on the Archer, Spider, and CSpider public training splits, with no domain- or schema-specific heuristics allowed. Model inference must complete within 60 seconds per query on a single GPU, and model weights must be released if parameter count exceeds 500M (Zheng et al., 2024).
3. Advances in Evaluation Methodology
Recent work has exposed limitations of EX, including its sensitivity to non-canonical SQL form, ambiguity in NL intent, and failure to penalize errors in ground-truth or dataset design (Pei et al., 14 Apr 2026). In response, the ROSE metric was introduced as an intent-centered, reference-agnostic approach deploying an adversarial Prover–Refuter cascade:
- Prover: Judges whether the predicted SQL (Sₚ), with only NL input and schema, is semantically correct—independent of the reference SQL (S_g). Utilizes LLM-based prompting to check query structure and result support.
- Refuter: Uses S_g and its execution result to challenge the Prover’s approval or highlight ambiguities and gold errors (diagnostic labels AmbQ and GoldX).
- Final ROSE score: 1 (acceptable) iff Sₚ is syntactically valid, passes independent semantic testing, and cannot be overturned by the adversarial Refuter:
ROSE demonstrates substantially higher human alignment than EX (Cohen’s κ=80.43% vs. 25.56% on ROSE-VEC). Recommendations for Archer 2025 include ROSE as a leaderboard metric, multi-ground-truth support, and explicit error labeling (Pei et al., 14 Apr 2026).
4. System Designs and Methodological Innovations
4.1 Planning-Centric Architectures
Top-performing systems—exemplified by OraPlan–SQL—employ a two-agent pipeline (Liu et al., 27 Oct 2025):
- Planner Agent: Produces explicit, step-by-step natural language plans that systematically decompose the NL question’s reasoning trajectory.
- SQL Agent: Converts plans into executable SQL, augmenting with in-context learning and entity-linking rules for schema alignment.
The meta-prompting strategy clusters planner failures, distills LLM-guided corrective guidelines, and injects them directly into the Planner’s system prompt. Explicit entity linking (with multi-form surface matching for entities) addresses transliteration and name-variation issues in the Chinese portion.
4.2 Plan Diversification and Majority Voting
Reliability gains are achieved by generating multiple (M) candidate plans and queries, then using majority voting based on execution outcomes: Diversification improves EX (e.g., from 71.15% to 72.12% using M=5), with validity maintained above 99% (Liu et al., 27 Oct 2025).
5. SQL Equivalence, Patterns, and LLM-Based Assessment
Evaluation of NL2SQL models increasingly relies on LLM-based equivalence evaluation rather than pure result matching (Zeng et al., 11 Jun 2025). Key formal distinctions:
- Exact Semantic Equivalence: (undecidable in the general case).
- Weak Equivalence: Accepts practical equivalence on production data, or with trivial rewrites (e.g., alias/ordering, commutativity).
LLM-based pipelines involve normalization, string-matching shortcuts, and CoT-based multi-run LLM assessment. Prompt templates include explicit equivalence rubrics and miniature paper-content "Mull" protocols, simulating counterexample generation on small DB instances. Reported on Dataverse, LLMs (GPT-4-0314) yield F1=0.9231 (Equivalent) and F1=0.7200 (Not Equivalent), significantly exceeding string-based matches.
Equivalence patterns relevant to Archer include:
- Join ↔ Subquery transformations
- DISTINCT ↔ GROUP BY
- Function, alias, and case normalization
- Aggregation and filtering variants
Common inequivalence sources are wrong JOIN keys, WHERE clause mislogic, misplaced groupings, and function misuse.
6. Experimental Results and Analysis
Challenge baselines indicate substantial difficulty:
- GPT-3.5 zero-shot: English EX = 3.85%, Chinese EX = 1.92% (test)
- GPT-4 + DIN-SQL: English EX = 6.73%
- Fine-tuned T5-3B: <5% EX Performance reveals dramatic degradation on hypothetical or nested queries (EX: 6.33% or below). The 2025 leaderboard is headed by OraPlan–SQL (GPT-5 backend): EX = 54.96% (EN), 56.67% (ZH), outpacing the next best by up to 12.6 points (Liu et al., 27 Oct 2025).
Ablation demonstrates that:
- Feedback-guided meta-prompting yields largest EX improvement (≈28 points).
- Explicit planning (Planner Agent) confers ≈7.7 point EX advantage over direct-to-SQL.
- Plan diversification and in-context learning bring smaller but consistent gains.
7. Recommendations for Benchmark and Pipeline Design
The literature prescribes several best-practice guidelines for future Archer iterations:
- Dataset Construction: Integrate human-labeled and synthetic query pairs targeting full equivalence/inequivalence pattern coverage; maintain small development sets for pipeline tuning.
- Prompting: Provide rich task, schema, and equivalence criteria; inject schema metadata and plan annotations; accommodate minor alias/order differences for weak equivalence.
- Evaluation: Employ a multistage pipeline—string-match, LLM-based CoT reasoning, optional execution fallback—with reporting for stability, error-category, and split P/R/F₁ metrics.
- Benchmarking: Use fixed synthetic pattern suites and real-world queries; distinguish strict from weak equivalence in rubric.
Adoption of ROSE or comparable intent-centered metrics is advised to ensure semantic correctness supersedes stylistic conformance, accommodate ambiguity, surface gold errors, and deliver reliable leaderboard outcomes (Zeng et al., 11 Jun 2025, Pei et al., 14 Apr 2026).
References:
- (Zheng et al., 2024) Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning
- (Liu et al., 27 Oct 2025) OraPlan-SQL: A Planning-Centric Framework for Complex Bilingual NL2SQL Reasoning
- (Zeng et al., 11 Jun 2025) Taming SQL Complexity: LLM-Based Equivalence Evaluation for Text-to-SQL
- (Pei et al., 14 Apr 2026) ROSE: An Intent-Centered Evaluation Metric for NL2SQL