Finch: Enterprise Financial Workflows
- Enterprise Financial Workflows (Finch) are complex, multi-component processes integrating multimodal data like spreadsheets, PDFs, and emails.
- The Finch benchmark assesses tasks such as data entry, financial modeling, and reporting through large-scale, realistic enterprise datasets.
- It highlights challenges like error propagation and multi-step reasoning, driving advances in AI for robust enterprise financial operations.
Enterprise financial workflows refer to the complex, multi-component processes that underlie finance and accounting operations in modern organizations. These workflows frequently span data entry, structuring, integration of unstructured and structured data, financial modeling, auditing, reporting, and regulatory compliance—all coordinated across spreadsheets, databases, documents, and communication artifacts. The “Finch” benchmark provides a rigorous, large-scale evaluation environment reflecting the authentic challenges inherent in these workflows, incorporating real-world spreadsheet files, multimodal artifacts, and natural language instructions. The following sections synthesize the landscape of enterprise financial workflows as operationalized and assessed in Finch, contextualized alongside contemporary advances in workflow automation, financial data processing, and agentic AI frameworks.
1. Scope and Motivation
Enterprise financial workflows, particularly as benchmarked in Finch, are characterized by their compositional, spreadsheet-centric, and highly multimodal structure. Real-world finance and accounting (F&A) work traverses heterogeneous artifacts—raw and structured data, custom formulas, code, charts, and unstructured text—while frequently relying on knowledge encoded in spreadsheet cells, historical versioning, and cross-modal retrieval (Dong et al., 15 Dec 2025). The primary motivation in developing Finch is to enable systematic evaluation of AI agents on end-to-end, professional-grade workflows that capture the intrinsic messiness, knowledge intensity, and collaborative nature of actual enterprise work.
Key attributes include:
- Compositionality: Tasks interleave data entry/import, table structuring, formula calculations, modeling, validation, visualization, translation, and document/report generation.
- Messiness: Artifacts are rife with cryptic terms, inconsistent formatting, version churn, hidden or complex formulas, cross-file dependencies, and collaborating user histories.
- Multi-modality: Workflows span text, spreadsheets (cellular and formulaic), PDFs, charts, code (e.g., Python scripts for data transformation), and images.
- Long-horizon reasoning: Tasks frequently require the coordinated execution of multiple subtasks, with frequent error propagation and accumulation.
2. Data Sources, Scale, and Workflow Taxonomy
Finch is constructed from authentic enterprise datasets including 15,000 Enron spreadsheets, 500,000 Enron emails representing workflows of 150 finance professionals, and contributions from the EUSES corpus, government filings, and high-value industry reports. This foundation yields:
- 172 workflows spanning 384 fine-grained tasks
- 1,710 unique spreadsheets (27 million cells; median workflow: 15k cells, 212 formulas)
- Supporting artifacts: 13 PDFs, 7 images, 3 Word documents, CSV/JSON/Markdown attachments
The covered domains reflect a broad spectrum of F&A operations: planning and budgeting, financial and narrative reporting, trading and risk management, predictive and valuation modeling, operations and procurement, accounts payable/receivable, pricing, and asset management. Many workflows are inherently cross-domain, with nearly 80% interleaving two or more of these areas (Dong et al., 15 Dec 2025).
The task taxonomy is as follows:
| Category | Definition/Example |
|---|---|
| Data Entry/Import | Transcribe/import data from PDFs/images/web: e.g., copy PDF table into spreadsheet, preserving formats |
| Structuring/Formatting | Modify tables: merging cells, formatting, hierarchical reorganization |
| Web Search | Fetch external data (e.g., exchange rates) and ingest into sheets |
| Cross-sheet Retrieval | Reference/pull values and formulas across sheets/files |
| Calculation | Populate formulas (NPV, ratios, etc.) |
| Financial Modeling | Extend valuation models, DCF, scenario analysis |
| Validation/Review | Reconcile subtotals, check for balance consistency |
| Translation | Translate spreadsheets or narrative reports, preserving layout |
| Visualization/Summary | Generate charts, pivot tables, and written summaries |
| Reporting | Assemble multi-modal deliverables, e.g., slide decks with embedded charts and tables |
3. Workflow Construction and Annotation Methodology
Finch workflows are derived through a two-stage pipeline: LLM-assisted discovery followed by meticulous expert annotation and normalization.
- LLM-Assisted Discovery: GPT-5 is tasked with parsing email threads to identify business goals and associated spreadsheet attachments, generating draft workflow instructions and identifying grounded input/output file pairs. Spreadsheet version histories are analyzed for meaningful diffs, from which probable task boundaries and instructions are inferred.
- Expert Annotation: Over 700 hours of expert effort is invested in normalizing instructions, aligning input and output files, curating reference outputs, and cross-validating all instruction–input–output triples. High-quality final reports from professional contexts are decomposed into realistic multi-step workflows. LLM-as-judge (GPT-5.1 Pro, Claude Sonnet 4.5) is used for secondary consistency checks.
Workflows are released under CC BY 3.0 US license. All sensitive data and PII is scrubbed (Dong et al., 15 Dec 2025).
4. Evaluation Protocols and Metrics
Finch introduces robust human and LLM-based evaluation protocols tailored to the intricacies of financial workflows:
- Human Evaluation: Domain experts read instructions, compare input, reference, and model outputs, and assign pass/fail grades. Passing requires full completion with no critical omission or spurious edits.
- LLM-as-Judge Automated Assessment: For modify/generate/QA workflows, structured diffs of cell content, formulas, formatting, and charts are computed between input, reference, and model output. For QA, answers are evaluated for correctness and completeness.
- Metrics:
- Core: Workflow pass rate, defined as
- Task-type breakdowns: e.g., Data Entry (20–25%), Visualization (≈50%), Translation (near 0%) - Per-task count pass rate: 44.3% (≤2 tasks), dropping to 23.5% (>2 tasks) - Average agent time per workflow (e.g., GPT-5.1 Pro: 16.8 min/workflow)
5. Key Results and Failure Modes
Empirical studies conducted using leading LLM-based agents (GPT-5.1 Pro, Claude Sonnet 4.5, Gemini 3 Pro, Grok 4, Qwen 3 Max) demonstrate that, even after 48 hours of cumulative agent processing:
- GPT-5.1 Pro: 38.4% human-judged pass rate (41.9% LLM-judged)
- Claude Sonnet 4.5: 25.0% human-judged
- Other API-based agents: <30% pass rate in single-call automated settings
Detailed error analysis reveals characteristic failure modes:
- Task Misunderstanding (~10%): AI agents misinterpret implicit business objectives or the contextual meaning embedded within spreadsheet formulas.
- Data Retrieval Errors (~25%): Incorrect extraction of values due to referencing wrong sheets or cell ranges.
- Formula Reasoning Failures (~35%): Misconstruction of business logic, e.g., ignoring cell dependency chains or incorrect financial function usage (e.g., XNPV with missed timing adjustments).
- Code Generation Failures (~25%): Errors in Python or macro scripts, including syntax and runtime exceptions.
- Data Rendering Errors (~5%): Loss of formatting, layout, or chart integrity, especially in translation and reporting tasks.
A prominent example is LLMs failing to preserve merged table headers during translation, rendering the output unusable. Another is mishandling of hidden formulas in scenario modeling, producing materially incorrect results (Dong et al., 15 Dec 2025).
6. Technical and Organizational Challenges
The deployment of AI agents for enterprise financial workflows faces multiple entrenched challenges:
- Accumulation of Error in Multi-step Composition: Error rates grow with task complexity and compositional depth, especially as outputs of one subtask become inputs for subsequent ones.
- Spreadsheet Ecosystem Fragmentation: Hidden formulas, merged cells, and nonstandard layouts in large spreadsheet collections impede end-to-end automation.
- Multimodal Reasoning Requirements: Agents must handle context spanning spreadsheets, PDF filings, web-sourced data, and mixed media in a single workflow.
- Tacit Domain Knowledge: Domain-specific operations and intent often reside within opaque spreadsheet structures, requiring inductive reasoning beyond literal inputs.
- Collaborative Workflow Dynamics: Version churn and multi-user editing introduce contextual dependencies not easily captured by static models.
7. Implications, Limitations, and Future Directions
The Finch benchmark sharply delineates the current limits of LLM-based and agentic AI in real-world financial workflows. Despite successes in interactive tasks (modeling, visualization), state-of-the-art agents remain brittle on long-horizon, composite, or structurally messy workflows. Persistent limitations involve reasoning over hidden or implicit business rules, cross-file dependencies, and multimodal data.
Substantial progress may require:
- Pre-training or fine-tuning on large, complex, multi-sheet spreadsheets to enhance structural and semantic understanding.
- Augmenting sequence models with symbolic spreadsheet engines and advanced planners for robust formula reasoning.
- Enabling iterative agentic tool use, mid-flow verification, and fallback strategies.
- Developing stronger multimodal backbones to jointly process spreadsheet data, cell- and sheet-level screenshots, and complementary modalities such as PDFs and images.
- Evolving evaluation metrics to reward partial solution progress and provide skill-level diagnostics, enabling more granular identification of systemic weaknesses.
By reflecting authentic enterprise complexity, Finch acts as a catalyst and yardstick for next-generation LLM-powered enterprise solutions, defining both the state of the art and the gap yet to be closed (Dong et al., 15 Dec 2025).