Papers
Topics
Authors
Recent
2000 character limit reached

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows (2512.13168v1)

Published 15 Dec 2025 in cs.AI, cs.CE, cs.IR, and cs.MA

Abstract: We introduce a finance & accounting benchmark (Finch) for evaluating AI agents on real-world, enterprise-grade professional workflows -- interleaving data entry, structuring, formatting, web search, cross-file retrieval, calculation, modeling, validation, translation, visualization, and reporting. Finch is sourced from authentic enterprise workspaces at Enron (15,000 spreadsheets and 500,000 emails from 150 employees) and other financial institutions, preserving in-the-wild messiness across multimodal artifacts (text, tables, formulas, charts, code, and images) and spanning diverse domains such as budgeting, trading, and asset management. We propose a workflow construction process that combines LLM-assisted discovery with expert annotation: (1) LLM-assisted, expert-verified derivation of workflows from real-world email threads and version histories of spreadsheet files, and (2) meticulous expert annotation for workflows, requiring over 700 hours of domain-expert effort. This yields 172 composite workflows with 384 tasks, involving 1,710 spreadsheets with 27 million cells, along with PDFs and other artifacts, capturing the intrinsically messy, long-horizon, knowledge-intensive, and collaborative nature of real-world enterprise work. We conduct both human and automated evaluations of frontier AI systems including GPT 5.1, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, and Qwen 3 Max, and GPT 5.1 Pro spends 48 hours in total yet passes only 38.4% of workflows, while Claude Sonnet 4.5 passes just 25.0%. Comprehensive case studies further surface the challenges that real-world enterprise workflows pose for AI agents.

Summary

  • The paper introduces Finch, a benchmark that evaluates AI agents on end-to-end, spreadsheet-centric finance and accounting workflows using authentic enterprise data.
  • The paper details a dual evaluation methodology with expert annotations and LLM-based judging, highlighting failure points in formula reasoning, data retrieval, and task comprehension.
  • The paper underscores the need for enhanced LLM capabilities in robust formula analysis, context management, and multimodal integration to meet real-world enterprise demands.

Comprehensive Evaluation of AI Agents in Real-World Finance and Accounting Workflows: The Finch Benchmark

Introduction

"Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows" (2512.13168) addresses one of the most acute limitations in evaluating LLM-based agents for enterprise applications: the absence of robust, realistic benchmarks that capture the full compositional, messy, and multimodal nature of finance and accounting (F&A) workflows at scale. The authors present Finch, a new benchmark suite derived from authentic enterprise data—spanning 1,710 spreadsheets (27 million cells) and cross-linked with heterogeneous artifacts (PDFs, images, emails, and additional documents)—which systematically probes the ability of advanced agentic systems to perform end-to-end professional F&A tasks.

Benchmark Construction and Characteristics

Finch distinguishes itself in several core aspects of benchmark construction:

  • Authentic Sourcing and High Complexity: Finch leverages the Enron corpus (15,000 spreadsheets and 500,000 emails) and integrates additional data from the EUSES corpus, financial institutions, and governmental organizations. The workflows retain real-world messiness, including nonuniform table schemas, deeply nested formulas, multimodal references, irregular layouts, and business-domain idiosyncrasies.
  • Workflow Derivation via LLM-Assistance and Expert Annotation: Workflow synthesis is grounded in two complementary methodologies. First, the team mines enterprise email threads for explicit task context, inferring business intent and grounding workflows in input-output artifact pairs when possible. Second, they extract and annotate workflows captured implicitly via spreadsheet version histories, leveraging LLM-based differencing to surface incremental analytic and curation steps. Over 700 hours of expert annotation ensure task instructions and references accurately reflect authentic analyst workflows.
  • Composite, Multitask, Long-Horizon Workflows: 78.5% of Finch workflows comprise multiple interleaved tasks—combining data entry/import, structuring, calculation, modeling, validation, reporting, and format transformation. The median workflow involves 15,000 cells across eight sheets (with a long tail up to 91 sheets and 3.7 million cells), with dense use of formulas and cross-reference logic.

Evaluation Protocols and Systems Analysis

Human and Automated Evaluation

Finch employs dual evaluation protocols:

  • Human Annotation: Domain experts assess agent outputs for faithfulness to instructions, correctness, and formatting, with binary pass/fail labels.
  • LLM-as-Judge: A multimodal, rubric-guided LLM (GPT-5-mini) cross-validates outputs via structured diffs, contextual cell extraction, and screenshot analysis to capture layout- and formula-sensitive errors. Agreement between the two is high: 82–90% across models, with the automated judge exhibiting high recall but slight overestimation of accuracy.

Systematic Agent Performance

Evaluations encompass both web/product agents (ChatGPT 5.1 Pro, Claude Sonnet 4.5) and API-based models (Gemini 3 Pro, Grok 4, Qwen 3 Max), with rigorous standardization of input representations and multimodal handling. Key findings include:

  • Frontier Model Limitations: The best product-side agent, GPT 5.1 Pro, passes only 38.4% of workflows using 48 total hours (mean 16.8 minutes/workflow), while Claude Sonnet 4.5 passes 25.0%. API-based single-call protocols yield a pass rate of 32% (GPT 5.1) under optimal spreadsheet encoding.
  • Task and Workflow Complexity Sensitivity: Pass rate decreases sub-linearly with the number of tasks: workflows with ≥3 tasks see GPT 5.1 Pro’s pass rate drop to 23.5%, and Claude Sonnet 4.5 to 11.8%. Data entry/import, structuring/formatting, and translation tasks are the principal failure modes, aggravated by cross-artifact dependencies (e.g., PDF import, web retrieval).
  • Error Typology: Failure categories are dominated by formula reasoning (35%), data retrieval (25%), code generation (25%), task misunderstanding (10%), and data rendering issues (5%). Notably, the primary point of degradation is not base ability but compositional, long-horizon integration, especially where workflow complexity, structural irregularity, and multimodality coincide.

Technical Implications

Finch challenges assumptions around LLM progress in "structured reasoning" tasks, evidencing a clear chasm between current leaderboards (which primarily emphasize stylized or small-scale QA/semantic parsing) and the demands of high-stakes F&A practice:

  • Formula and Schema Grounding: Tasks require robust formula analysis, latent business logic extraction, and precise transformation over nonrectangular, multilevel, and semantically ambiguous tables; even minor errors in formula reference or semantic mapping propagate into major global errors.
  • Contextual Dependency Across Heterogeneous Artifacts: Success depends on precise navigation among interrelated files (spreadsheets, PDFs, emails), with frequent cross-referencing and content alignment. Models must maintain schema consistency, avoid spurious edits, and reconstruct subtle context-specific semantics.
  • Multimodal and Multilingual Alignment: Around 10% of workflows involve multimodal reasoning; translation tasks expose brittleness in LLMs—mishandling structure or omitting layout details—contradicting NLP task performance on general-domain translation.
  • Evaluation Methodology: Finch’s calibrated LLM-as-judge framework demonstrates practical advances for scalable assessment of spreadsheet and workflow reasoning, but also surfaces the necessity for robust human-in-the-loop validation for subtle or high-impact error classes.

Relation to Prior Work

Prior F&A and spreadsheet reasoning benchmarks (e.g., FinQA [chen2021finqa], FinanceBench [islam2023financebench], SpreadsheetBench [ma2024spreadsheetbench], SpreadsheetLLM [spreadsheetllm2024], FinMaster [finmaster2025], XFINBENCH [xfinbench2025]) typically emphasize smaller-scale, single-table, or single-task regimes, often with homogeneous or synthetic data and insufficient coverage of long-horizon, messy workflows. Finch supersedes these by directly targeting collaborative, versioned artifact collections, and evaluating systems’ compositional generalization in scenario-driven enterprise contexts.

Broader Implications and Future Directions

The clear performance ceiling exposed by Finch has both practical and research consequences:

  • For Applied AI: Adoption of LLM-based automation in enterprise F&A remains bounded by composite workflow performance, especially where accuracy, traceability, and compliance are required (e.g., regulatory reporting, large-scale financial modeling). Finch reveals actionable gaps—formula understanding, cross-artifact reasoning, robust code synthesis, and long-horizon planning—that must be addressed for safe deployment.
  • For Models and Agents: Progress likely demands (i) explicit inductive biases for formula and table logic, (ii) improved context management and schema tracking across large artifacts, (iii) robust multimodal co-reference and alignment, and (iv) agentic affordances supporting multi-step, iterative refinement with intermediate validation and correction.
  • For Benchmarks: Finch establishes a new realistic bar for agent evaluation, serving as a suite that pushes beyond “toy” problems toward messy, high-fidelity enterprise environments. Future benchmarks may generalize further to additional business domains, richer artifact types, deeper collaborative histories, or concurrent multi-user workflow modeling.

Conclusion

Finch provides the most comprehensive and challenging benchmark to date for spreadsheet-centric workflow AI evaluation in real-world F&A enterprise settings. Its scale, artifact heterogeneity, task compositionality, and tightly controlled expert annotation surface a wide and previously underestimated gap between current frontier models and actual professional requirements. Systematic advances, as reflected by Finch, will be essential for credible progress toward robust, deployable, agentic AI for enterprise use cases.

Whiteboard

Open Problems

We found no open problems mentioned in this paper.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 5 tweets with 41 likes about this paper.