FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling (2510.01887v1)

Published 2 Oct 2025 in q-fin.CP and cs.AI

Abstract: Text-to-SQL, the task of translating natural language questions into SQL queries, has long been a central challenge in NLP. While progress has been significant, applying it to the financial domain remains especially difficult due to complex schema, domain-specific terminology, and high stakes of error. Despite this, there is no dedicated large-scale financial dataset to advance research, creating a critical gap. To address this, we introduce a curated financial dataset (FINCH) comprising 292 tables and 75,725 natural language-SQL pairs, enabling both fine-tuning and rigorous evaluation. Building on this resource, we benchmark reasoning models and LLMs of varying scales, providing a systematic analysis of their strengths and limitations in financial Text-to-SQL tasks. Finally, we propose a finance-oriented evaluation metric (FINCH Score) that captures nuances overlooked by existing measures, offering a more faithful assessment of model performance.

Summary

The paper introduces FINCH, a benchmark that integrates diverse financial databases to deliver 75,725 natural language–SQL pairs with rigorous validation.
It proposes the FINCH Score, a tailored metric combining clause-level structural scoring and tolerance-based execution accuracy for financial applications.
Experimental results highlight that domain-specific finetuning enhances performance on complex queries, emphasizing the need for advanced schema linking and compositional reasoning.

FINCH: A Domain-Specific Benchmark and Metric for Financial Text-to-SQL

Introduction

The FINCH benchmark addresses a critical gap in Text-to-SQL research: the lack of a large-scale, finance-specific dataset and evaluation methodology tailored to the unique demands of financial databases. While general-purpose datasets such as Spider and BIRD have driven advances in schema generalization and compositional reasoning, their schemas and query distributions are not representative of the complexity, terminology, and regulatory constraints inherent in financial systems. FINCH consolidates and refines data from multiple sources, resulting in a benchmark that is both broad in coverage and deep in schema complexity, with a focus on real-world financial operations.

Figure 1: Representation of the FINCH dataset showing the integration of different databases and tables across financial domains.

FINCH Dataset Construction

FINCH is constructed through a rigorous curation process, selecting only those databases and queries from Spider, BIRD, BULL, and BookSQL that are directly relevant to financial contexts. The resulting dataset comprises 33 databases, 292 tables, 2,233 columns, and 177 relations, totaling 75,725 natural language–SQL pairs. The dataset spans domains such as retail, banking, loans, insurance, e-commerce, funds, stocks, and accounting, ensuring coverage of the operational diversity encountered in financial analytics.

A key aspect of the curation process is the systematic validation and correction of SQL queries. Each query is executed against its corresponding SQLite database, and anomalies—ranging from incorrect column names to invalid table references and syntax errors—are identified and rectified. This ensures that FINCH maintains a high standard of data integrity, which is essential for both model training and evaluation in high-stakes financial applications.

The dataset is stratified by difficulty, with explicit annotation of easy, medium, and hard queries, and includes a significant proportion of complex SQL constructs such as nested queries, GROUP BY, and ORDER BY clauses. This design supports the evaluation of both basic and advanced reasoning capabilities in Text-to-SQL models.

Evaluation Methodology and FINCH Score

Traditional evaluation metrics for Text-to-SQL—Exact Match (EM), Execution Accuracy (EX), and Component Matching (CM)—are insufficient for the financial domain. They fail to account for the materiality of errors, clause-level semantic importance, and tolerances relevant to financial reporting and compliance. FINCH introduces a finance-specific metric, the FINCH Score, which integrates:

Component-wise structural scoring: Weighted similarity across SQL components (SELECT, WHERE, GROUP BY, HAVING, JOIN, etc.), with higher weights assigned to clauses critical for financial correctness.
Execution accuracy with tolerance: Binary correctness is relaxed to allow for domain-appropriate tolerances (e.g., $10^{-4}$ or $0.01\%$ ) in numerical outputs, reflecting the principle of materiality in financial reporting.
Combined penalized envelope: Structural and execution scores are combined multiplicatively, with tunable parameters to reflect the relative importance of structure versus execution in financial workflows.

This metric provides a more faithful assessment of model performance, capturing partial correctness and prioritizing errors that have substantive financial impact.

Experimental Results and Analysis

A comprehensive benchmarking paper was conducted using FINCH, evaluating a spectrum of models:

Large-scale LLMs: Qwen3-235B-A22B, GPT-OSS-120B
Medium/small-scale LLMs: Qwen3-8B, GPT-OSS-20B
Reasoning-centric models: Phi-4-mini-reasoning, Arctic-Text2SQL-R1-7B

All models were evaluated under a uniform one-shot prompting protocol, with strict constraints on schema fidelity and SQL syntax. The results demonstrate several key findings:

GPT-OSS-120B achieves the highest overall performance across all metrics, including the FINCH Score, indicating that scale remains advantageous when models are sufficiently optimized for the task.
Arctic-Text2SQL-R1-7B, despite its smaller parameter count, outperforms much larger models when domain-specific finetuning is applied, highlighting the importance of adaptation to financial schemas and terminology.
Clause-level analysis reveals that the majority of errors are concentrated in SELECT, FROM, and WHERE clauses, which are critical for schema grounding and semantic correctness. Performance on GROUP BY, HAVING, and ORDER BY is marginally better but still suboptimal.
Model accuracy degrades sharply with increasing query difficulty. For example, GPT-OSS-120B's FINCH Score drops from 26.5% on easy queries to 4.5% on hard queries, underscoring persistent challenges in compositional and multi-table reasoning.

The FINCH Score provides a more nuanced view of model performance, recognizing partial successes and penalizing only those errors that materially affect financial outcomes. This stands in contrast to the binary nature of EM and EX, which can over-penalize immaterial deviations.

Implications and Future Directions

The introduction of FINCH and its tailored evaluation metric has several implications for both research and practice:

Domain-specific finetuning is essential: Models adapted to financial SQL schemas and terminology can rival or surpass much larger general-purpose LLMs, suggesting that future work should prioritize domain adaptation over indiscriminate scaling.
Evaluation metrics must reflect domain priorities: The FINCH Score's clause weighting and tolerance-based execution accuracy align evaluation with the operational realities of financial analytics, providing a more actionable measure of model utility.
Persistent challenges in schema linking and compositional reasoning: The concentration of errors in schema-sensitive clauses and the steep drop in accuracy on complex queries indicate that further advances in schema linking, context modeling, and multi-table reasoning are required for reliable deployment in financial settings.

Future research directions include the integration of multi-modal data (text, tables, SQL), robust schema linking methods, and conversational Text-to-SQL systems capable of supporting iterative analyst workflows. The FINCH benchmark and metric provide a foundation for these developments, enabling rigorous evaluation and targeted improvement of financial Text-to-SQL systems.

Conclusion

FINCH establishes a new standard for financial Text-to-SQL research, offering a large-scale, domain-specific dataset and a finance-aware evaluation metric that together address the limitations of prior benchmarks. Experimental results demonstrate that domain adaptation and clause-sensitive evaluation are critical for progress in this area. The persistent challenges identified by FINCH—particularly in schema grounding and compositional reasoning—highlight the need for continued innovation in model architectures and training methodologies. FINCH is poised to serve as a cornerstone resource for advancing reliable, high-utility Text-to-SQL systems in the financial domain.