Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 59 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 33 tok/s Pro
GPT-4o 127 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 421 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

FINCH Score Evaluation

Updated 4 October 2025
  • FINCH Score is a finance-focused metric that decomposes SQL queries into weighted clauses to assess structural fidelity and monetary relevance.
  • It introduces tolerance thresholds for execution accuracy, ensuring that minor numeric imprecisions do not penalize materially correct outputs.
  • The score integrates structural and execution components using tunable parameters, aligning query evaluation with real-world financial risk and compliance needs.

The FINCH Score is a finance-oriented evaluation metric devised for the assessment of Text-to-SQL systems, specifically tailored to address the nuanced requirements of financial data and queries. Unlike general-purpose metrics such as Exact Matching or Execution Accuracy, the FINCH Score introduces clause-sensitive weighting, tolerance-aware execution comparison, and a composite scoring scheme that together reflect both the structural fidelity and the material correctness demanded in professional financial settings.

1. Motivation and Problem Statement

Traditional Text-to-SQL evaluation metrics—Exact Matching (EM), Execution Accuracy (EX), and structural Component Matching (CM)—prove inadequate for finance because they are either excessively rigid (penalizing trivial numeric or syntactic deviations) or assign equal weight to all SQL components regardless of their criticality. Financial tasks often involve complex queries where, for example, an error in a WHERE or JOIN clause can greatly impact risk computation or compliance, whereas superficial orderings or minor numeric differences may not be material.

The FINCH Score addresses these shortcomings by integrating clause-level importance, tolerance to practical numerical imprecision, and a partial credit mechanism that aligns evaluation with the operational realities of financial analysis (Singh et al., 2 Oct 2025).

2. Component-wise Structural Scoring

The metric begins by decomposing both the gold reference query (qq^*) and the system-predicted query (q^\hat{q}) into a set of key SQL components KK, with typical elements including SELECT, WHERE, GROUP BY, HAVING, ORDER BY, JOIN, AGGREGATE, LIMIT, and SUBQUERY. For each clause kKk \in K, a similarity function sk(q^,q)[0,1]s_k(\hat{q}, q^*) \in [0, 1] quantifies the match between gold and predicted queries, using approaches such as exact string match, set-F1, or token-F1.

Each clause kk is assigned an application-informed weight wkw_k, satisfying wk0w_k \geq 0 and kwk=1\sum_k w_k = 1, representing its relative importance in financial contexts. The weighted sum

S(q^,q)=kwksk(q^,q)S(\hat{q}, q^*) = \sum_k w_k \cdot s_k(\hat{q}, q^*)

captures the overall structural fidelity of the generated query, ensuring that clauses with high financial significance (e.g., those affecting compliance or aggregation) have greater impact on the score.

3. Execution Accuracy with Tolerance

Financial query execution places importance on material outcomes rather than insignificant numerical variations. The FINCH Score models this by introducing a tolerance-driven execution metric. Let rq^r_{\hat{q}} and rqr_{q^*} denote the executed results of the predicted and reference queries, respectively. The execution similarity is defined by

e(q^,q)={1if rq^rqmax{1,rq}τ 0otherwisee(\hat{q}, q^*) = \begin{cases} 1 & \text{if } \frac{|r_{\hat{q}} - r_{q^*}|}{\max\{1, |r_{q^*}|\}} \leq \tau \ 0 & \text{otherwise} \end{cases}

where τ\tau is a materiality threshold (for instance, 10410^{-4} or 0.01%0.01\%), so that small, operationally irrelevant discrepancies in numeric output are not harshly penalized.

4. Combined FINCH Score Formulation

The structural and execution components are integrated multiplicatively in the final FINCH Score through the formula

Score(q^,q)=[S(q^,q)]β[δ+(1δ)e(q^,q)]\text{Score}(\hat{q}, q^*) = \left[ S(\hat{q}, q^*) \right]^\beta \cdot \left[\delta + (1 - \delta) \cdot e(\hat{q}, q^*) \right]

where:

  • β1\beta \geq 1 controls the harshness against incomplete structural matches,
  • δ[0,1)\delta \in [0, 1) specifies the minimum credit for structural correctness even when result execution fails (e.g., δ=0.3\delta = 0.3).

This formulation allows for partial credit when a query is structurally sound but slightly misses in execution—a reflection of how financial analysts might accept minor, correctable issues if the critical business logic is present.

5. Distinction from Existing Metrics

The FINCH Score diverges from traditional metrics in the following respects:

  • Clause weighting: Not all SQL components contribute equally; financial weightings penalize for mistakes in high-impact clauses (such as WHERE/AGG/JOIN).
  • Tolerance-adjusted execution: Recognizes small floating-point or rounding errors as immaterial, aligning evaluation with practices such as financial materiality thresholds.
  • Multiplicative and tunable aggregation: Combines structure and execution via parameters β\beta and δ\delta, with partial credit ensuring nuanced reflection of real-world acceptability.

This approach produces a more representative assessment of system performance for financial query generation than EM (all-or-nothing), EX (too coarse), or unweighted CM metrics.

6. Suitability and Methodological Design for Financial Text-to-SQL

The FINCH Score is specifically engineered for the finance domain, where idiomatic schema, business logic, and regulatory requirements dominate. The metric emphasizes high-weighted matching of critical predicates, grouping, and joins that are essential in financial audit or risk scenarios, while tolerating minor non-impactful deviations (e.g., nonmaterial numeric variations or ORDER BY).

In practice, weights wkw_k and threshold τ\tau are set using empirical analysis of historical financial misqueries and consultation with domain experts. This aligns penalty severity with monetary or regulatory risk, allowing institutions to tune the metric for their own operational contexts. The methodology included benchmarking on a curated financial dataset of 75,725 instances from 292 tables (Singh et al., 2 Oct 2025).

Experimental evidence in the paper demonstrates that FINCH Score grants partial credit to queries that preserve core semantics even if superficial errors exist, whereas general metrics often assign zero in such cases.

7. Practical Implications and Contextual Significance

In financial data environments, the FINCH Score offers an evaluation well-aligned with operational quality assurance, regulatory compliance, and business-critical query correctness:

  • It supports the prioritization of error correction efforts toward clauses and discrepancies of highest financial consequence.
  • It recognizes near-correct outputs as valuable, thereby providing a more realistic reflection of deployable model utility.
  • Its mechanism for partial credit is suitable for iterative human-in-the-loop workflows common in financial analytics.

This metric constitutes a significant step in financial NLP and Text-to-SQL system evaluation, providing more actionable and context-aware feedback for both academic research and practical deployment in finance (Singh et al., 2 Oct 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to FINCH Score.