Finch Finance & Accounting Benchmark
- Finch is a benchmark for finance and accounting AI, integrating authentic, multi-modal enterprise workflows to assess end-to-end automation.
- It employs a three-stage hybrid pipeline founded on LLMs and expert curation to manage tasks from email-grounded discovery to artifact decomposition.
- The evaluation reveals significant gaps in LLMs on multi-step financial workflows, highlighting challenges in formula interpretation and context retention.
Finch is a finance and accounting benchmark designed for rigorous evaluation of AI agents on authentic, spreadsheet-centric enterprise workflows. Distinct from earlier table or synthetic QA datasets, Finch is purpose-built to reflect the complexities—messiness, multimodality, long-horizon, and collaboration—of professional financial operations, drawing heavily from in-the-wild enterprise artifacts such as email threads and versioned spreadsheets. Finch establishes a new standard for end-to-end assessment of AI agents in office automation and finance domains, surfacing the significant gap that persists between state-of-the-art LLM agents and the demands of real-world enterprise workflows (Dong et al., 15 Dec 2025).
1. Benchmark Composition and Scale
Finch curates a large suite of workflows, tasks, and artifacts from real-world enterprise environments:
- Source Material:
- Enron Email Corpus: ~500,000 emails from 150 employees and 15,000 original XLS files
- EUSES Financial Spreadsheets: ~450 spreadsheets
- Institutional reports from the World Bank, Canadian and British governments
- Proprietary financial artifacts from investment banks
- Benchmark Contents:
- 172 composite workflows and 384 fine-grained annotated tasks
- 1,710 distinct spreadsheet files (average 8 sheets/workbook, 157,000 cells/workflow, totaling ≈27 million cells)
- 13 PDFs, 7 images, 3 Word documents, and JSON, CSV, Markdown code/metadata artifacts
- Domain Distribution (number of workflows per vertical):
- Reporting: 48
- Trading & Risk Management: 35
- Predictive Modeling: 33
- Operational Management: 36
- Planning & Budgeting: 26
- Pricing & Valuation: 15
- Accounts Payable/Receivable: 10
- Procurement & Sales: 7
- Asset Management: 3
- Over 78% of workflows are multi-step (2–5 tasks), reinforcing long-horizon context retention and error control requirements.
2. Workflow Construction Methodology
Finch deploys a three-stage hybrid pipeline tightly coupling LLM assistance and domain-expert curation:
- Email-Grounded Discovery:
- LLM (GPT-5) identifies business-goal-centric email threads referencing spreadsheets, drafts high-level workflow descriptions, and expert annotators normalize them into precise task instructions. This process ensures strong input/output grounding when possible or reconstructs partial ground truth via version history.
- Version-History Differencing:
- Paired versioned spreadsheets are compared. LLMs generate “diff workflows” summarizing semantic changes (e.g., formula updates, new scenarios, chart additions). Experts confirm user-intent mapping and finalize instructions, with explicit pre-/post- state XLS files.
- Artifact Decomposition:
- Finalized spreadsheet models, valuation templates, QA cases, and multi-language reports are expertly decomposed such that the original artifact is the target and intermediate snapshots become context-rich inputs with tailored instructions. Annotations undergo inter-annotator review and LLM-based (ChatGPT-5.1 Pro, Claude 4.5) sanity checks for alignment, with over 700 expert hours committed.
3. Task Taxonomy and Multi-Modality
Finch tasks are explicitly tagged for one or more of ten distinct task types, reflecting actual enterprise workflows:
| Task Type | Description/Example |
|---|---|
| Data Entry / Import | Extract from PDF/web to Excel; e.g., World Bank PDF table parsing |
| Structuring / Formatting | Hierarchy re-org, styles, pivotization |
| Web Search | Fetching live market metrics to spreadsheets |
| Cross-sheet Retrieval | Aggregation across multiple sheets/files |
| Calculation | Formula synthesis, e.g., NPV calculation under scenario splits |
| Financial Modeling | Valuation/sensitivity/scenario model building |
| Validation / Review | Consistency, error checking across files/sheets |
| Translation | Table+format-preserving cross-language transformation |
| Summary / Visualization | Charting, tabular/narrative summaries |
| Reporting | Final deliverable assembly from sheet outputs |
A majority of workflows demand simultaneous handling of textual, tabular, visual (charts, images), and code artifacts, often requiring cross-file or cross-modality reasoning. Multi-step task composition is dominant, especially in enterprise contexts where outputs from one step become structured inputs for downstream validation, modeling, and reporting.
4. Evaluation Protocol and Metrics
Finch employs a dual-mode evaluation strategy:
- Human Evaluation:
- Expert annotators compare model instruction, input, reference, and outputs in parallel. Pass is awarded solely if the agent’s output (a) exactly satisfies the instruction, (b) is free from critical errors or non-deterministic edits, and (c) lacks extraneous changes.
- Label granularity is binary (Pass/Fail), scored at both workflow and task level.
- Automated LLM-Judge Evaluation:
- Each task is classified as “Modify,” “Generate,” or “QA.” Diff-based snapshotting, with retention of layout/screenshot fidelity, enables LLM (e.g., GPT-5.1 Pro) to judge completeness, numerical/logical accuracy, formatting, and over-edits.
- The key metric is PassRate.
Table: Finch Workflow Pass Rates (human evaluation)
| Model | Pass / Total | Pass Rate (%) |
|---|---|---|
| GPT-5.1 Pro (web agent) | 66 / 172 | 38.4 |
| Claude Sonnet 4.5 | 43 / 172 | 25.0 |
High agreement (GPT-5.1: 82.1% agreement with auto-judge, 83.3% recall) validates the scalability and reliability of LLM-judge as a proxy. API-based automation uses openpyxl/pandas/matplotlib tooling, single-shot per instruction, prohibiting iterative refinement.
5. Detailed Results, Challenges, and Error Modes
Performance on Finch reveals major challenges for current frontier models:
- Workflow Complexity:
- GPT-5.1 Pro succeeds on 48.6% of single-task workflows, but only 23.5% on multi-task (2+) workflows. Claude 4.5 falls from 30.3% to 11.8%.
- Task Breakdown:
- Data Entry/Import and Structuring/Formatting register the lowest completion rates, reflecting difficulties with irregular table layouts, PDF parsing, and merging disparate schema.
- Translation workflows, which demand simultaneous preservation of semantic content and cell-level structure, exhibit <20% success across all agents.
- Failure Analysis:
- Error accumulation in multi-step workflows is prevalent: retrieval/aggregation mistakes, formula misinterpretation or overwrites, loss of business logic, and incomplete propagation of changes or context.
- Models notably ignore or overwrite embedded formulas, misread business logic embedded in complex calculation flows (e.g., deferred payments, scenario splits).
- Multimodal reasoning over PDFs, dense charts, and images imposes extraction and alignment costs not captured in syntactic table QA.
A plausible implication is that the “last mile” problem of robust office automation is not addressable by surface-level chain-of-thought prompting, instead requiring agentic feedback loops, state-tracking across modalities, and advanced schema induction.
6. Significance, Comparison, and Research Directions
Finch’s design diverges fundamentally from previous finance and accounting benchmarks in multiple respects:
- Realistic Messiness and Scale:
- It uniquely combines long-horizon, multi-sheet, multi-artifact, and cross-modal structures grounded in actual enterprise workflows, unlike task-specific QA datasets (e.g., FINCH Text-to-SQL (Singh et al., 2 Oct 2025), FinanceReasoning (Tang et al., 6 Jun 2025), FinDABench (Liu et al., 1 Jan 2024)) or statement-centric pipelines (FinAR-Bench (Wu et al., 22 May 2025)).
- Ground Truth and Instruction Alignment:
- All instructions, inputs, and outputs are meticulously validated by multiple layers (expert + LLM) for fidelity, avoiding hallucinated ground truths.
- Evaluation Framework:
- Human- and auto-judge protocol establishes high-recall, scalable benchmarking, permitting task, workflow, and model-family comparison across cutting-edge agents.
- Limitations and Opportunities:
- Finch exposes persistent class-level errors in formula interpretation, context retention, and multimodal artifact comprehension. Few-shot and long-context models have not bridged the foundational gap for reliable E2E automation.
- There is immediate opportunity for research in agentic error correction, programmatic validation, dynamic context management, and enhanced formula understanding for enterprise-scale LLM agents.
In summary, Finch is a rigorously constructed, large-scale benchmark that operationalizes the longstanding vision of spreadsheet-era finance and accounting task evaluation for AI. It offers the field a transparent, high-fidelity reference for iterative progress toward practical, automated financial workflow execution (Dong et al., 15 Dec 2025).