Finch: Enterprise Financial Workflows

Updated 9 February 2026

Enterprise Financial Workflows (Finch) are complex, multi-component processes integrating multimodal data like spreadsheets, PDFs, and emails.
The Finch benchmark assesses tasks such as data entry, financial modeling, and reporting through large-scale, realistic enterprise datasets.
It highlights challenges like error propagation and multi-step reasoning, driving advances in AI for robust enterprise financial operations.

Enterprise financial workflows refer to the complex, multi-component processes that underlie finance and accounting operations in modern organizations. These workflows frequently span data entry, structuring, integration of unstructured and structured data, financial modeling, auditing, reporting, and regulatory compliance—all coordinated across spreadsheets, databases, documents, and communication artifacts. The “Finch” benchmark provides a rigorous, large-scale evaluation environment reflecting the authentic challenges inherent in these workflows, incorporating real-world spreadsheet files, multimodal artifacts, and natural language instructions. The following sections synthesize the landscape of enterprise financial workflows as operationalized and assessed in Finch, contextualized alongside contemporary advances in workflow automation, financial data processing, and agentic AI frameworks.

1. Scope and Motivation

Enterprise financial workflows, particularly as benchmarked in Finch, are characterized by their compositional, spreadsheet-centric, and highly multimodal structure. Real-world finance and accounting (F&A) work traverses heterogeneous artifacts—raw and structured data, custom formulas, code, charts, and unstructured text—while frequently relying on knowledge encoded in spreadsheet cells, historical versioning, and cross-modal retrieval (Dong et al., 15 Dec 2025). The primary motivation in developing Finch is to enable systematic evaluation of AI agents on end-to-end, professional-grade workflows that capture the intrinsic messiness, knowledge intensity, and collaborative nature of actual enterprise work.

Key attributes include:

Compositionality: Tasks interleave data entry/import, table structuring, formula calculations, modeling, validation, visualization, translation, and document/report generation.
Messiness: Artifacts are rife with cryptic terms, inconsistent formatting, version churn, hidden or complex formulas, cross-file dependencies, and collaborating user histories.
Multi-modality: Workflows span text, spreadsheets (cellular and formulaic), PDFs, charts, code (e.g., Python scripts for data transformation), and images.
Long-horizon reasoning: Tasks frequently require the coordinated execution of multiple subtasks, with frequent error propagation and accumulation.

2. Data Sources, Scale, and Workflow Taxonomy

Finch is constructed from authentic enterprise datasets including 15,000 Enron spreadsheets, 500,000 Enron emails representing workflows of 150 finance professionals, and contributions from the EUSES corpus, government filings, and high-value industry reports. This foundation yields:

172 workflows spanning 384 fine-grained tasks
1,710 unique spreadsheets (27 million cells; median workflow: 15k cells, 212 formulas)
Supporting artifacts: 13 PDFs, 7 images, 3 Word documents, CSV/JSON/Markdown attachments

The covered domains reflect a broad spectrum of F&A operations: planning and budgeting, financial and narrative reporting, trading and risk management, predictive and valuation modeling, operations and procurement, accounts payable/receivable, pricing, and asset management. Many workflows are inherently cross-domain, with nearly 80% interleaving two or more of these areas (Dong et al., 15 Dec 2025).

The task taxonomy is as follows:

Category	Definition/Example
Data Entry/Import	Transcribe/import data from PDFs/images/web: e.g., copy PDF table into spreadsheet, preserving formats
Structuring/Formatting	Modify tables: merging cells, formatting, hierarchical reorganization
Web Search	Fetch external data (e.g., exchange rates) and ingest into sheets
Cross-sheet Retrieval	Reference/pull values and formulas across sheets/files
Calculation	Populate formulas (NPV, ratios, etc.)
Financial Modeling	Extend valuation models, DCF, scenario analysis
Validation/Review	Reconcile subtotals, check for balance consistency
Translation	Translate spreadsheets or narrative reports, preserving layout
Visualization/Summary	Generate charts, pivot tables, and written summaries
Reporting	Assemble multi-modal deliverables, e.g., slide decks with embedded charts and tables

3. Workflow Construction and Annotation Methodology

Finch workflows are derived through a two-stage pipeline: LLM-assisted discovery followed by meticulous expert annotation and normalization.

LLM-Assisted Discovery: GPT-5 is tasked with parsing email threads to identify business goals and associated spreadsheet attachments, generating draft workflow instructions and identifying grounded input/output file pairs. Spreadsheet version histories are analyzed for meaningful diffs, from which probable task boundaries and instructions are inferred.
Expert Annotation: Over 700 hours of expert effort is invested in normalizing instructions, aligning input and output files, curating reference outputs, and cross-validating all instruction–input–output triples. High-quality final reports from professional contexts are decomposed into realistic multi-step workflows. LLM-as-judge (GPT-5.1 Pro, Claude Sonnet 4.5) is used for secondary consistency checks.

Workflows are released under CC BY 3.0 US license. All sensitive data and PII is scrubbed (Dong et al., 15 Dec 2025).

4. Evaluation Protocols and Metrics

Finch introduces robust human and LLM-based evaluation protocols tailored to the intricacies of financial workflows:

Human Evaluation: Domain experts read instructions, compare input, reference, and model outputs, and assign pass/fail grades. Passing requires full completion with no critical omission or spurious edits.
LLM-as-Judge Automated Assessment: For modify/generate/QA workflows, structured diffs of cell content, formulas, formatting, and charts are computed between input, reference, and model output. For QA, answers are evaluated for correctness and completeness.
Metrics:
- Core: Workflow pass rate, defined as
$\text{pass\_rate} = \frac{N_\text{passed}}{N_\text{total}} \times 100\%$ - Task-type breakdowns: e.g., Data Entry (20–25%), Visualization (≈50%), Translation (near 0%) - Per-task count pass rate: 44.3% (≤2 tasks), dropping to 23.5% (>2 tasks) - Average agent time per workflow (e.g., GPT-5.1 Pro: 16.8 min/workflow)

5. Key Results and Failure Modes

Empirical studies conducted using leading LLM-based agents (GPT-5.1 Pro, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, Qwen 3 Max) demonstrate that, even after 48 hours of cumulative agent processing:

GPT-5.1 Pro: 38.4% human-judged pass rate (41.9% LLM-judged)
Claude Sonnet 4.5: 25.0% human-judged
Other API-based agents: <30% pass rate in single-call automated settings

Detailed error analysis reveals characteristic failure modes:

Task Misunderstanding (~10%): AI agents misinterpret implicit business objectives or the contextual meaning embedded within spreadsheet formulas.
Data Retrieval Errors (~25%): Incorrect extraction of values due to referencing wrong sheets or cell ranges.
Formula Reasoning Failures (~35%): Misconstruction of business logic, e.g., ignoring cell dependency chains or incorrect financial function usage (e.g., XNPV with missed timing adjustments).
Code Generation Failures (~25%): Errors in Python or macro scripts, including syntax and runtime exceptions.
Data Rendering Errors (~5%): Loss of formatting, layout, or chart integrity, especially in translation and reporting tasks.

A prominent example is LLMs failing to preserve merged table headers during translation, rendering the output unusable. Another is mishandling of hidden formulas in scenario modeling, producing materially incorrect results (Dong et al., 15 Dec 2025).

6. Technical and Organizational Challenges

The deployment of AI agents for enterprise financial workflows faces multiple entrenched challenges:

Accumulation of Error in Multi-step Composition: Error rates grow with task complexity and compositional depth, especially as outputs of one subtask become inputs for subsequent ones.
Spreadsheet Ecosystem Fragmentation: Hidden formulas, merged cells, and nonstandard layouts in large spreadsheet collections impede end-to-end automation.
Multimodal Reasoning Requirements: Agents must handle context spanning spreadsheets, PDF filings, web-sourced data, and mixed media in a single workflow.
Tacit Domain Knowledge: Domain-specific operations and intent often reside within opaque spreadsheet structures, requiring inductive reasoning beyond literal inputs.
Collaborative Workflow Dynamics: Version churn and multi-user editing introduce contextual dependencies not easily captured by static models.

7. Implications, Limitations, and Future Directions

The Finch benchmark sharply delineates the current limits of LLM-based and agentic AI in real-world financial workflows. Despite successes in interactive tasks (modeling, visualization), state-of-the-art agents remain brittle on long-horizon, composite, or structurally messy workflows. Persistent limitations involve reasoning over hidden or implicit business rules, cross-file dependencies, and multimodal data.

Substantial progress may require:

Pre-training or fine-tuning on large, complex, multi-sheet spreadsheets to enhance structural and semantic understanding.
Augmenting sequence models with symbolic spreadsheet engines and advanced planners for robust formula reasoning.
Enabling iterative agentic tool use, mid-flow verification, and fallback strategies.
Developing stronger multimodal backbones to jointly process spreadsheet data, cell- and sheet-level screenshots, and complementary modalities such as PDFs and images.
Evolving evaluation metrics to reward partial solution progress and provide skill-level diagnostics, enabling more granular identification of systemic weaknesses.

By reflecting authentic enterprise complexity, Finch acts as a catalyst and yardstick for next-generation LLM-powered enterprise solutions, defining both the state of the art and the gap yet to be closed (Dong et al., 15 Dec 2025).

Markdown Report Issue Upgrade to Chat

References (1)

Finch: Benchmarking Finance & Accounting across Spreadsheet-Centric Enterprise Workflows (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Enterprise Financial Workflows (Finch).

Finch: Enterprise Financial Workflows

1. Scope and Motivation

2. Data Sources, Scale, and Workflow Taxonomy

3. Workflow Construction and Annotation Methodology

4. Evaluation Protocols and Metrics

5. Key Results and Failure Modes

6. Technical and Organizational Challenges

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Finch: Enterprise Financial Workflows

1. Scope and Motivation

2. Data Sources, Scale, and Workflow Taxonomy

3. Workflow Construction and Annotation Methodology

4. Evaluation Protocols and Metrics

5. Key Results and Failure Modes

6. Technical and Organizational Challenges

7. Implications, Limitations, and Future Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research