FINCH Score Evaluation

Updated 4 October 2025

FINCH Score is a finance-focused metric that decomposes SQL queries into weighted clauses to assess structural fidelity and monetary relevance.
It introduces tolerance thresholds for execution accuracy, ensuring that minor numeric imprecisions do not penalize materially correct outputs.
The score integrates structural and execution components using tunable parameters, aligning query evaluation with real-world financial risk and compliance needs.

The FINCH Score is a finance-oriented evaluation metric devised for the assessment of Text-to-SQL systems, specifically tailored to address the nuanced requirements of financial data and queries. Unlike general-purpose metrics such as Exact Matching or Execution Accuracy, the FINCH Score introduces clause-sensitive weighting, tolerance-aware execution comparison, and a composite scoring scheme that together reflect both the structural fidelity and the material correctness demanded in professional financial settings.

1. Motivation and Problem Statement

Traditional Text-to-SQL evaluation metrics—Exact Matching (EM), Execution Accuracy (EX), and structural Component Matching (CM)—prove inadequate for finance because they are either excessively rigid (penalizing trivial numeric or syntactic deviations) or assign equal weight to all SQL components regardless of their criticality. Financial tasks often involve complex queries where, for example, an error in a WHERE or JOIN clause can greatly impact risk computation or compliance, whereas superficial orderings or minor numeric differences may not be material.

The FINCH Score addresses these shortcomings by integrating clause-level importance, tolerance to practical numerical imprecision, and a partial credit mechanism that aligns evaluation with the operational realities of financial analysis (Singh et al., 2 Oct 2025).

2. Component-wise Structural Scoring

The metric begins by decomposing both the gold reference query ( $q^*$ ) and the system-predicted query ( $\hat{q}$ ) into a set of key SQL components $K$ , with typical elements including SELECT, WHERE, GROUP BY, HAVING, ORDER BY, JOIN, AGGREGATE, LIMIT, and SUBQUERY. For each clause $k \in K$ , a similarity function $s_k(\hat{q}, q^*) \in [0, 1]$ quantifies the match between gold and predicted queries, using approaches such as exact string match, set-F1, or token-F1.

Each clause $k$ is assigned an application-informed weight $w_k$ , satisfying $w_k \geq 0$ and $\sum_k w_k = 1$ , representing its relative importance in financial contexts. The weighted sum

$S(\hat{q}, q^*) = \sum_k w_k \cdot s_k(\hat{q}, q^*)$

captures the overall structural fidelity of the generated query, ensuring that clauses with high financial significance (e.g., those affecting compliance or aggregation) have greater impact on the score.

3. Execution Accuracy with Tolerance

Financial query execution places importance on material outcomes rather than insignificant numerical variations. The FINCH Score models this by introducing a tolerance-driven execution metric. Let $r_{\hat{q}}$ and $r_{q^*}$ denote the executed results of the predicted and reference queries, respectively. The execution similarity is defined by

$e(\hat{q}, q^*) = \begin{cases} 1 & \text{if } \frac{|r_{\hat{q}} - r_{q^*}|}{\max\{1, |r_{q^*}|\}} \leq \tau \ 0 & \text{otherwise} \end{cases}$

where $\tau$ is a materiality threshold (for instance, $10^{-4}$ or $0.01\%$ ), so that small, operationally irrelevant discrepancies in numeric output are not harshly penalized.

4. Combined FINCH Score Formulation

The structural and execution components are integrated multiplicatively in the final FINCH Score through the formula

$\text{Score}(\hat{q}, q^*) = \left[ S(\hat{q}, q^*) \right]^\beta \cdot \left[\delta + (1 - \delta) \cdot e(\hat{q}, q^*) \right]$

where:

$\beta \geq 1$ controls the harshness against incomplete structural matches,
$\delta \in [0, 1)$ specifies the minimum credit for structural correctness even when result execution fails (e.g., $\delta = 0.3$ ).

This formulation allows for partial credit when a query is structurally sound but slightly misses in execution—a reflection of how financial analysts might accept minor, correctable issues if the critical business logic is present.

5. Distinction from Existing Metrics

The FINCH Score diverges from traditional metrics in the following respects:

Clause weighting: Not all SQL components contribute equally; financial weightings penalize for mistakes in high-impact clauses (such as WHERE/AGG/JOIN).
Tolerance-adjusted execution: Recognizes small floating-point or rounding errors as immaterial, aligning evaluation with practices such as financial materiality thresholds.
Multiplicative and tunable aggregation: Combines structure and execution via parameters $\beta$ and $\delta$ , with partial credit ensuring nuanced reflection of real-world acceptability.

This approach produces a more representative assessment of system performance for financial query generation than EM (all-or-nothing), EX (too coarse), or unweighted CM metrics.

6. Suitability and Methodological Design for Financial Text-to-SQL

The FINCH Score is specifically engineered for the finance domain, where idiomatic schema, business logic, and regulatory requirements dominate. The metric emphasizes high-weighted matching of critical predicates, grouping, and joins that are essential in financial audit or risk scenarios, while tolerating minor non-impactful deviations (e.g., nonmaterial numeric variations or ORDER BY).

In practice, weights $w_k$ and threshold $\tau$ are set using empirical analysis of historical financial misqueries and consultation with domain experts. This aligns penalty severity with monetary or regulatory risk, allowing institutions to tune the metric for their own operational contexts. The methodology included benchmarking on a curated financial dataset of 75,725 instances from 292 tables (Singh et al., 2 Oct 2025).

Experimental evidence in the paper demonstrates that FINCH Score grants partial credit to queries that preserve core semantics even if superficial errors exist, whereas general metrics often assign zero in such cases.

7. Practical Implications and Contextual Significance

In financial data environments, the FINCH Score offers an evaluation well-aligned with operational quality assurance, regulatory compliance, and business-critical query correctness:

It supports the prioritization of error correction efforts toward clauses and discrepancies of highest financial consequence.
It recognizes near-correct outputs as valuable, thereby providing a more realistic reflection of deployable model utility.
Its mechanism for partial credit is suitable for iterative human-in-the-loop workflows common in financial analytics.

This metric constitutes a significant step in financial NLP and Text-to-SQL system evaluation, providing more actionable and context-aware feedback for both academic research and practical deployment in finance (Singh et al., 2 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling (2025)

FINCH Score Evaluation

1. Motivation and Problem Statement

2. Component-wise Structural Scoring

3. Execution Accuracy with Tolerance

4. Combined FINCH Score Formulation

5. Distinction from Existing Metrics

6. Suitability and Methodological Design for Financial Text-to-SQL

7. Practical Implications and Contextual Significance

Whiteboard

Follow Topic

Continue Learning

FINCH Score Evaluation

1. Motivation and Problem Statement

2. Component-wise Structural Scoring

3. Execution Accuracy with Tolerance

4. Combined FINCH Score Formulation

5. Distinction from Existing Metrics

6. Suitability and Methodological Design for Financial Text-to-SQL

7. Practical Implications and Contextual Significance

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics