FINCH Benchmark: Evaluating Financial Text-to-SQL

Updated 19 November 2025

FINCH Benchmark is a domain-specific evaluation framework for Text-to-SQL in finance, integrating complex schema and financial terminology.
It compiles 33 databases with 292 tables and over 75K natural language–SQL pairs across diverse financial domains like banking, loans, and accounting.
The FINCH Score uses clause-weighted metrics to assess model performance, highlighting challenges in schema linking and complex SQL reasoning.

FINCH (Financial Intelligence using Natural language for Contextualized SQL Handling) is a domain-specific benchmark for evaluating Text-to-SQL systems in finance. Addressing the absence of specialized, large-scale datasets in this domain, FINCH introduces a resource with complex schema, domain-specific terminology, and rigorous evaluation methods. It incorporates a new clause-weighted metric—the FINCH Score—designed to effectively capture nuanced failings and partial successes in financial SQL generation.

1. Dataset Composition and Scope

FINCH encompasses 33 databases sourced from Spider (22), BIRD (7), BULL (3), and BookSQL (1), systematically filtered to represent financial verticals such as retail sales, banking transactions, loans, insurance, e-commerce, funds & portfolios, stocks, accounting, marketing, and credit cards. The dataset contains 292 tables with 2,233 columns and 177 foreign-key relations. The total number of validated natural language–SQL pairs is 75,725.

Schema complexity includes a mean of approximately 7.65 columns per table and about 0.61 foreign-key links per table. Data types mirror standard SQL types (INTEGER, FLOAT, VARCHAR, DATE), with a considerable emphasis on numeric types due to the prevalence of accounting and risk metrics.

FINCH’s annotation protocol produces a heterogeneous mix of SQL constructs:

Simple SELECT (no aggregation, no joins): 9,358 (≈12%, “easy”)
Aggregations (GROUP BY, HAVING): 33,780 (≈45%, “medium”)
Nested subqueries: included in 32,587 (≈43%, “hard”)
Joins: prevalent across medium and hard difficulty bins

Dataset partitions follow a strict cross-database protocol—no database appears in more than one split—and ensure representation from each of the ten financial domains per split. Cross-domain evaluation is performed by training on 80% of databases and holding out the remaining 20% for testing.

2. Construction Procedures and Quality Control

Schema selection was guided by domain experts, focusing on canonical financial structure and excluding non-financial domains. Inclusion criteria required explicit mapping of schema tables or columns to banking, loans, sales, or portfolio analytics.

Annotation steps entailed executing all SQL queries in SQLite, followed by automated error flagging and manual correction. The original sources varied in error rates:

Source	Total Queries	Failures
BIRD	1,139	327
BULL	4,966	60
BookSQL	78,433	9,526
Spider	—	0

Failures stemmed from syntactic issues, bad column/table references, and other errors, all of which were subsequently resolved.

An additional 7,035 previously unlabeled NL questions were annotated by two SQL experts following BookSQL guidelines; disagreement resolution was achieved through consensus. No inter-annotator agreement statistics such as Cohen’s κ were reported.

3. Benchmarking Protocol and Model Evaluation

Evaluated model categories comprised:

Large LLMs: Qwen3-235B-A22B (≈235B parameters), GPT-OSS-120B (≈120B)
Mid/Small Open-Source Models: GPT-OSS-20B (≈20B), Qwen3-8B (≈8B)
Reasoning-Centric Models: Phi-4-mini-reasoning (≈1B), Arctic-Text2SQL-R1-7B (≈7B)

All experiments employed one-shot prompting (no finetuning). Inference hyperparameters were standardized: max_new_tokens=512, temperature=0.0, stop_token="\n\n".

Prompts concatenate:

An expert SQL assistant system message
The NL question
Serialized schema (table-column lists)
Fixed constraints restricting outputs to provided schema elements

Schema serialization (linear versus tree-based) produced negligible performance differences (<1 point in FINCH Score), indicating that scaling model size shows greater impact.

4. The FINCH Score: Metric Design and Rationale

FINCH introduces a composite metric sensitive to the structure and semantics of SQL queries within financial contexts. Given a gold query $q^*$ and model prediction $\hat q$ , the structural component is:

$S(\hat q, q^*) = \sum_{k \in K} w_k s_k(\hat q, q^*)$

where $K$ indexes SQL clauses (SELECT, WHERE, JOIN, GROUP BY, etc.), $s_k$ denotes set-based F1 overlap on clause $k$ , and $w_k$ are empirically chosen weights summing to 1 (e.g., $w_\mathrm{WHERE} = 0.20$ , $w_\mathrm{JOIN} = 0.20$ ).

Execution accuracy $e(\hat q, q^*)$ is defined with tolerance $\tau=10^{-4}$ : $e(\hat q, q^*) = \begin{cases} 1, & \frac{|r_{\hat q} - r_{q^*}|}{\max(1,|r_{q^*}|)} \leq \tau \ 0, & \text{otherwise} \end{cases}$

The final composite FINCH Score is: $\mathrm{FINCH}(\hat q, q^*) = S(\hat q, q^*)^\beta \cdot [\delta + (1-\delta) e(\hat q, q^*)]$ In the reported configuration: $\beta=1$ , $\delta=0.3$ .

Differences from prior metrics include clause-level partial credit, tolerance to immaterial floating-point discrepancies, and clause weighting to penalize omission of “business-critical” elements—addressing key deficiencies in Exact Match and Execution Accuracy.

5. Model Performance and Analytical Results

FINCH benchmarking reveals substantial variation in model competence:

Model	EM	EX	CM	EM+EX	FINCH
Qwen3-8B	0.50	0.80	3.50	0.10	1.20
Arctic-Text2SQL-R1-7B	0.60	2.30	3.70	0.20	1.50
Phi-4-mini-reasoning	0.00	0.20	1.00	0.00	0.40
GPT-OSS-20B	0.30	7.50	5.20	0.30	3.00
GPT-OSS-120B	1.80	27.80	16.60	1.70	11.60
Qwen3-235B-A22B	0.70	2.50	2.80	0.20	1.20

GPT-OSS-120B (120B parameters) achieves the highest FINCH Score by a wide margin, but Arctic-Text2SQL-R1-7B (7B parameters, domain-finetuned) outperforms larger generalist models in certain settings, indicating the efficacy of domain adaptation.

Clause-wise accuracy is low across all models, with FROM clauses performing best (7.37%) and WHERE, SELECT, HAVING, and ORDER BY under 3% on average. Major error types include mis-naming columns/tables, incorrect JOIN key references, and confusion among aggregates (e.g., SUM vs. AVG).

Difficulty stratification using GPT-OSS-120B yields FINCH Scores of 26.5% (easy), 10.6% (medium), and 4.5% (hard), illustrating a pronounced drop as compositional and nested query complexity increases.

Ablation reveals that Arctic achieves a ~15% reduction in schema-linking errors in WHERE/JOIN clauses through financial domain adaptation. Switching from one-shot to zero-shot prompts reduces FINCH Score by ~4 points.

6. Impact, Limitations, and Prospective Extensions

FINCH establishes the first large-scale, finance-specialized, cross-database Text-to-SQL benchmark, supporting fine-tuning and robust evaluation in high-stakes, terminology-rich settings.

The FINCH Score offers more fine-grained assessment than legacy metrics, prioritizing structure-aware and tolerance-based evaluation—a crucial feature given the materiality of errors in financial analytics.

Persistent limitations include model difficulties with schema grounding (SELECT/FROM/WHERE clauses) and multi-table JOIN logic. The performance drop-off on hard queries highlights ongoing challenges with compositional and nested SQL reasoning.

Planned extensions for FINCH include:

Multimodal alignment of PDF financial documents and SQL schemas
Interactive, conversational Text-to-SQL for iterative analyst workflows
Real-time execution feedback and human-in-the-loop correction frameworks for improved reliability

These directions aim to close persistent gaps in financial reasoning, adaptability, and deployment-readiness for Text-to-SQL models in applied finance contexts (Singh et al., 2 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to FINCH Benchmark.

FINCH Benchmark: Evaluating Financial Text-to-SQL

1. Dataset Composition and Scope

2. Construction Procedures and Quality Control

3. Benchmarking Protocol and Model Evaluation

4. The FINCH Score: Metric Design and Rationale

5. Model Performance and Analytical Results

6. Impact, Limitations, and Prospective Extensions

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FINCH Benchmark: Evaluating Financial Text-to-SQL

1. Dataset Composition and Scope

2. Construction Procedures and Quality Control

3. Benchmarking Protocol and Model Evaluation

4. The FINCH Score: Metric Design and Rationale

5. Model Performance and Analytical Results

6. Impact, Limitations, and Prospective Extensions

Sponsor

Whiteboard

Topic to Video (Beta)

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research