FINCH Benchmark: Evaluating Financial Text-to-SQL
- FINCH Benchmark is a domain-specific evaluation framework for Text-to-SQL in finance, integrating complex schema and financial terminology.
- It compiles 33 databases with 292 tables and over 75K natural language–SQL pairs across diverse financial domains like banking, loans, and accounting.
- The FINCH Score uses clause-weighted metrics to assess model performance, highlighting challenges in schema linking and complex SQL reasoning.
FINCH (Financial Intelligence using Natural language for Contextualized SQL Handling) is a domain-specific benchmark for evaluating Text-to-SQL systems in finance. Addressing the absence of specialized, large-scale datasets in this domain, FINCH introduces a resource with complex schema, domain-specific terminology, and rigorous evaluation methods. It incorporates a new clause-weighted metric—the FINCH Score—designed to effectively capture nuanced failings and partial successes in financial SQL generation.
1. Dataset Composition and Scope
FINCH encompasses 33 databases sourced from Spider (22), BIRD (7), BULL (3), and BookSQL (1), systematically filtered to represent financial verticals such as retail sales, banking transactions, loans, insurance, e-commerce, funds & portfolios, stocks, accounting, marketing, and credit cards. The dataset contains 292 tables with 2,233 columns and 177 foreign-key relations. The total number of validated natural language–SQL pairs is 75,725.
Schema complexity includes a mean of approximately 7.65 columns per table and about 0.61 foreign-key links per table. Data types mirror standard SQL types (INTEGER, FLOAT, VARCHAR, DATE), with a considerable emphasis on numeric types due to the prevalence of accounting and risk metrics.
FINCH’s annotation protocol produces a heterogeneous mix of SQL constructs:
- Simple SELECT (no aggregation, no joins): 9,358 (≈12%, “easy”)
- Aggregations (GROUP BY, HAVING): 33,780 (≈45%, “medium”)
- Nested subqueries: included in 32,587 (≈43%, “hard”)
- Joins: prevalent across medium and hard difficulty bins
Dataset partitions follow a strict cross-database protocol—no database appears in more than one split—and ensure representation from each of the ten financial domains per split. Cross-domain evaluation is performed by training on 80% of databases and holding out the remaining 20% for testing.
2. Construction Procedures and Quality Control
Schema selection was guided by domain experts, focusing on canonical financial structure and excluding non-financial domains. Inclusion criteria required explicit mapping of schema tables or columns to banking, loans, sales, or portfolio analytics.
Annotation steps entailed executing all SQL queries in SQLite, followed by automated error flagging and manual correction. The original sources varied in error rates:
| Source | Total Queries | Failures |
|---|---|---|
| BIRD | 1,139 | 327 |
| BULL | 4,966 | 60 |
| BookSQL | 78,433 | 9,526 |
| Spider | — | 0 |
Failures stemmed from syntactic issues, bad column/table references, and other errors, all of which were subsequently resolved.
An additional 7,035 previously unlabeled NL questions were annotated by two SQL experts following BookSQL guidelines; disagreement resolution was achieved through consensus. No inter-annotator agreement statistics such as Cohen’s κ were reported.
3. Benchmarking Protocol and Model Evaluation
Evaluated model categories comprised:
- Large LLMs: Qwen3-235B-A22B (≈235B parameters), GPT-OSS-120B (≈120B)
- Mid/Small Open-Source Models: GPT-OSS-20B (≈20B), Qwen3-8B (≈8B)
- Reasoning-Centric Models: Phi-4-mini-reasoning (≈1B), Arctic-Text2SQL-R1-7B (≈7B)
All experiments employed one-shot prompting (no finetuning). Inference hyperparameters were standardized: max_new_tokens=512, temperature=0.0, stop_token="\n\n".
Prompts concatenate:
- An expert SQL assistant system message
- The NL question
- Serialized schema (table-column lists)
- Fixed constraints restricting outputs to provided schema elements
Schema serialization (linear versus tree-based) produced negligible performance differences (<1 point in FINCH Score), indicating that scaling model size shows greater impact.
4. The FINCH Score: Metric Design and Rationale
FINCH introduces a composite metric sensitive to the structure and semantics of SQL queries within financial contexts. Given a gold query and model prediction , the structural component is:
where indexes SQL clauses (SELECT, WHERE, JOIN, GROUP BY, etc.), denotes set-based F1 overlap on clause , and are empirically chosen weights summing to 1 (e.g., , ).
Execution accuracy is defined with tolerance :
The final composite FINCH Score is: In the reported configuration: , .
Differences from prior metrics include clause-level partial credit, tolerance to immaterial floating-point discrepancies, and clause weighting to penalize omission of “business-critical” elements—addressing key deficiencies in Exact Match and Execution Accuracy.
5. Model Performance and Analytical Results
FINCH benchmarking reveals substantial variation in model competence:
| Model | EM | EX | CM | EM+EX | FINCH |
|---|---|---|---|---|---|
| Qwen3-8B | 0.50 | 0.80 | 3.50 | 0.10 | 1.20 |
| Arctic-Text2SQL-R1-7B | 0.60 | 2.30 | 3.70 | 0.20 | 1.50 |
| Phi-4-mini-reasoning | 0.00 | 0.20 | 1.00 | 0.00 | 0.40 |
| GPT-OSS-20B | 0.30 | 7.50 | 5.20 | 0.30 | 3.00 |
| GPT-OSS-120B | 1.80 | 27.80 | 16.60 | 1.70 | 11.60 |
| Qwen3-235B-A22B | 0.70 | 2.50 | 2.80 | 0.20 | 1.20 |
GPT-OSS-120B (120B parameters) achieves the highest FINCH Score by a wide margin, but Arctic-Text2SQL-R1-7B (7B parameters, domain-finetuned) outperforms larger generalist models in certain settings, indicating the efficacy of domain adaptation.
Clause-wise accuracy is low across all models, with FROM clauses performing best (7.37%) and WHERE, SELECT, HAVING, and ORDER BY under 3% on average. Major error types include mis-naming columns/tables, incorrect JOIN key references, and confusion among aggregates (e.g., SUM vs. AVG).
Difficulty stratification using GPT-OSS-120B yields FINCH Scores of 26.5% (easy), 10.6% (medium), and 4.5% (hard), illustrating a pronounced drop as compositional and nested query complexity increases.
Ablation reveals that Arctic achieves a ~15% reduction in schema-linking errors in WHERE/JOIN clauses through financial domain adaptation. Switching from one-shot to zero-shot prompts reduces FINCH Score by ~4 points.
6. Impact, Limitations, and Prospective Extensions
FINCH establishes the first large-scale, finance-specialized, cross-database Text-to-SQL benchmark, supporting fine-tuning and robust evaluation in high-stakes, terminology-rich settings.
The FINCH Score offers more fine-grained assessment than legacy metrics, prioritizing structure-aware and tolerance-based evaluation—a crucial feature given the materiality of errors in financial analytics.
Persistent limitations include model difficulties with schema grounding (SELECT/FROM/WHERE clauses) and multi-table JOIN logic. The performance drop-off on hard queries highlights ongoing challenges with compositional and nested SQL reasoning.
Planned extensions for FINCH include:
- Multimodal alignment of PDF financial documents and SQL schemas
- Interactive, conversational Text-to-SQL for iterative analyst workflows
- Real-time execution feedback and human-in-the-loop correction frameworks for improved reliability
These directions aim to close persistent gaps in financial reasoning, adaptability, and deployment-readiness for Text-to-SQL models in applied finance contexts (Singh et al., 2 Oct 2025).