FinAgentBench: Agentic Retrieval in Finance
- FinAgentBench is a benchmarking suite for evaluating LLM retrieval and agentic reasoning in specialized financial workflows.
- It employs a modular, two-step agentic retrieval process that mimics analyst workflows by first selecting document types and then pinpointing relevant content chunks.
- The suite uses expert-annotated query-document pairs and tailored metrics such as nDCG and MRR to assess LLM performance and reveal domain-specific challenges.
FinAgentBench is a benchmarking suite designed to rigorously evaluate information retrieval and agentic reasoning capabilities of LLMs and LLM-based agents in specialized financial workflows. It introduces a modular, interpretable evaluation protocol centered on multi-step retrieval over high-value financial disclosures, with an annotation framework and metric suite tailored to the complexity and domain-specificity of professional financial research. As the first large-scale benchmark for agentic evidence retrieval in finance, FinAgentBench has fostered extensions across enterprise accounting, wealth management, and execution-grounded safety assessment.
1. Problem Motivation and Agentic Retrieval Paradigm
Financial research demands extraction of precise, context-dependent information from extensive, heterogeneous filings (10-K, 10-Q, 8-K, earnings transcripts, DEF-14A). Traditional retrieval pipelines—both sparse (e.g., BM25, TF–IDF) and dense neural embeddings—are frequently inadequate for two central reasons:
- Semantic fine-grainedness: Financial concepts (e.g., “proxy statement” vs. “DEF-14A”; GAAP vs. non-GAAP metrics) require domain alignment and structural understanding, not simple keyword overlap or sentence embeddings.
- Document scale and latency: Indexing and querying thousands of pages or tables per firm result in prohibitive context windows or information loss due to truncation.
FinAgentBench formalizes agentic retrieval as a modular, two-step process:
- Document Type Selection: Given a query , select the most relevant filing type via .
- Chunk Pinpointing: Within , score and rank paragraph- or table-level “chunks” using .
This decomposition mirrors analyst workflows and yields interpretable stages for error attribution and model analysis (Choi et al., 7 Aug 2025, Ng et al., 18 Nov 2025).
2. Dataset Construction and Annotation Protocol
FinAgentBench provides 3,429 expert-annotated query-document pairs for S&P-100 firms, spanning five SEC filing types. The construction protocol comprises:
- Query Generation: Two finance professionals independently draft ten queries for each company, cubered over ten categories (e.g., Analyst Q&A, Risks, Operating Metrics), with cross-validation to final consensus. Each query is designed to invoke nuanced, context-driven searching.
- Document and Chunk Preprocessing: >15,000 filings are split into paragraph-level chunks, with each table treated as a single unit to preserve semantic coherence.
- Dual-Phase Annotation: (1) Filing types are ranked per query using a drag-and-drop interface; (2) every chunk within the top-ranked document is graded on a TREC-scale (0=irrelevant, 1=partially relevant, 2=directly relevant), with adjudicated dual annotation for label consistency.
Key dataset statistics:
- Average relevant document types per query: 2.92
- Average directly relevant chunks: 6.41
- Average partially relevant chunks: 9.24 (Choi et al., 7 Aug 2025).
3. Evaluation Protocol and Metrics
FinAgentBench adheres to a rigorous information retrieval metric regime at both document and chunk levels over the top-5 predictions per stage:
- Precision@k (P@k):
- Recall@k (R@k): , relevant items
- Mean Reciprocal Rank (MRR@k):
- Mean Average Precision (MAP@k):
- nDCG@k: , normalized by the ideal DCG (Choi et al., 7 Aug 2025, Ng et al., 18 Nov 2025).
This layered metric suite enables direct attribution of bottlenecks—whether in document scoping or intra-document localization—with all relevance labels and permutation ground truths derived from expert annotation.
4. Agent Baselines, Fine-Tuning, and Performance Analysis
FinAgentBench benchmarks the retrieval-centric performance of leading LLMs in strict zero-shot, retrieval-only settings. Representative results include (Choi et al., 7 Aug 2025, Ng et al., 18 Nov 2025):
| Model | Doc nDCG@5 | Doc MAP@5 | Doc MRR@5 | Chunk nDCG@5 | Chunk MRR@5 |
|---|---|---|---|---|---|
| Claude-Sonnet-4 | 0.783 | 0.849 | 0.892 | 0.419 | 0.567 |
| GPT-o3 | 0.770 | 0.829 | 0.875 | — | — |
Chunk-level retrieval is consistently less accurate, highlighting the intrinsic challenge of pinpointing sparse, heterogeneously distributed relevant evidence under tight window constraints. Targeted reinforcement fine-tuning on GPT-o4-mini (90% of training split) yields robust stage-wise improvements: Doc nDCG@5 rises from 0.758 to 0.808, and MRR@5 from 0.872 to 0.933; chunk nDCG@5 from 0.345 to 0.371, and MRR@5 from 0.526 to 0.587. These results empirically establish that even strong LLMs benefit from domain-specific, task-aligned supervision.
5. Extensions Across Financial Workflows and Agentic Reasoning
The agentic retrieval paradigm of FinAgentBench has catalyzed a spectrum of dataset extensions covering enterprise finance and accounting (Dong et al., 15 Dec 2025), wealth management (Milsom, 1 Dec 2025), and execution-grounded agent security (Yang et al., 9 Jan 2026).
- Enterprise Financial Workflows (Finch): Tasks span multi-modal artifacts in Enron-style real-world spreadsheet and email corpora, interleaving structuring, cross-file retrieval, calculation, validation, and translation. Workflow Pass Rate and Task Success Rate are primary metrics, with GPT-5.1 Pro achieving a 38.4% pass rate. Steep performance decrements are observed on workflows exceeding two compositional subtasks (Dong et al., 15 Dec 2025).
- Wealth Management (Closed-World Evaluation): Benchmarks apply paired high- and low-autonomy variants across retrieval, analytic, and communication workflows, with deterministic grading via multi-checkpoint validators. Metrics include overall accuracy, reliability, and API cost per task, facilitating Pareto analysis of autonomy vs. economic efficiency (Milsom, 1 Dec 2025).
- Security in Execution-Grounded Environments: The FinAgentBench extension within FinVault formalizes 31 auditable scenarios with mutable state, cross-domain tool call privileges, 107 regulatory vulnerabilities, and attack taxonomies including prompt injection, jailbreaking, and authority impersonation. Metrics such as Attack Success Rate and Vulnerability Compromise Rate highlight persistent agentic safety gaps (ASR up to 50%); even the most robust models manifest a VulnRate >20% (Yang et al., 9 Jan 2026).
6. Principal Data-Driven Insights and Open Problems
Evaluations on FinAgentBench and its extensions reveal:
- LLMs exhibit strong reporting priors at the document-type selection stage but degrade sharply at fine-grained chunking due to context fragmentation and evidence sparsity (Choi et al., 7 Aug 2025, Ng et al., 18 Nov 2025).
- Domain-specific tuning and structured prompts with explicit reasoning scaffolds yield significant returns, while uninformed extension of in-context learning (ICL) to all retrieval stages can be detrimental (Ng et al., 18 Nov 2025).
- Real-world, multi-modal enterprise workflows expose persistent limitations in layout-awareness, formula interpretation, cross-file reasoning, and multimodal fusion, which current agent architectures struggle to surmount (Dong et al., 15 Dec 2025).
- For security and compliance, traditional safety alignments fail to transfer, with role-playing and authority-impersonation attacks dominating even in tightly sandboxed, auditable environments (Yang et al., 9 Jan 2026).
Open challenges include scaling to full S&P 500 coverage, enriching for temporal and comparative queries, simulating collaborative and adversarial workflows, quantifying scenario severity, and jointly integrating retrieval with answer-generation in “truly agentic” assistants.
7. Implications and Future Directions
FinAgentBench establishes a foundation for research into compositional reasoning, interpretable error analysis, and rigorous, domain-grounded evaluation of LLM agents in financial contexts. Forthcoming releases will extend to the S&P 500, introduce longitudinal analyses across reporting periods, and investigate unified agentic protocols combining retrieval, answer selection, and compliance validation (Choi et al., 7 Aug 2025).
Key recommendations for future systems include explicit modeling of domain policies, introduction of tool-augmented actions (e.g., for formula computation and chart parsing), robust prompt isolation for safety, and granular measurement of cost and fidelity in high-stakes financial workflows. These directions aim to bridge persistent capability gaps in evidence-centered reasoning, agentic orchestration, and operational security in applied financial AI.