FinAgentBench Dataset Overview
- FinAgentBench is a comprehensive benchmark suite providing datasets and evaluation frameworks to measure AI agents' performance in authentic financial scenarios.
- It integrates real-world financial tasks, multi-modal inputs, and tool-based workflows to simulate complex analysis and decision-making processes.
- Rigorous evaluation protocols and performance metrics, including accuracy and ranking scores, highlight current AI limitations and inform future research directions.
FinAgentBench refers to a set of large-scale, rigorously constructed benchmark datasets and evaluation frameworks for measuring the capabilities of AI agents—especially LLM agents—in the financial domain. The designation "FinAgentBench" appears in multiple independently developed works that provide complementary resources for evaluating multi-step reasoning, tool-based workflows, information retrieval, and financial analysis by AI systems. The benchmarks are designed to simulate real-world finance practice, emphasizing authentic and high-complexity scenarios that require extensive domain knowledge, access to structured and unstructured financial documents, and intricate reasoning or workflow orchestration (Zeng et al., 23 Jul 2025, Bigeard et al., 20 May 2025, Choi et al., 7 Aug 2025).
1. Origins, Motivations, and Design Principles
FinAgentBench encompasses datasets produced independently by several research groups to address the pronounced deficiency of domain-specific evaluation resources in finance. While prior static question answering and general agent benchmarks concentrated on open web or non-financial domains, FinAgentBench initiatives prioritize:
- Task authenticity, relying on expert-authored questions and real, time-sensitive documents (regulatory filings, market data, news, and product indicators).
- Breadth and depth, covering the principal sub-domains of finance: securities, funds, banking, insurance, futures, trusts, asset management, and public market research.
- Tool and workflow integration, requiring agents to autonomously utilize external tools (web scraping, PDF/Excel/auditory parsers, EDGAR retrieval, code execution) as part of multi-step reasoning.
- Complex scenario construction, modeled after actual financial analyst, investor, and compliance workflows, demanding information synthesis, logical inference, and structured output.
- Hierarchical complexity, reflecting increasing task depth from basic factual queries to strategic reasoning (e.g., asset allocation, risk management) (Zeng et al., 23 Jul 2025, Bigeard et al., 20 May 2025).
2. Dataset Compositions and Taxonomies
Three principal datasets have been published under the "FinAgentBench" name or closely related labels:
| Dataset | Tasks/Instances | Domains/Subdomains | Task Types |
|---|---|---|---|
| FinGAIA (aka FinAgentBench) (Zeng et al., 23 Jul 2025) | 407 | Seven (securities, funds, banking, insurance, futures, trusts, asset management) | Multi-modal, multi-step agent tasks across three scenario depths |
| Finance Agent Benchmark (Bigeard et al., 20 May 2025) | 537 | Nine (expert-defined research tasks on US-listed firms) | Retrieval, numerical/qualitative reasoning, modeling, analysis |
| FinAgentBench: Agentic Retrieval (Choi et al., 7 Aug 2025) | 3,429 | S&P 100 (initial release) | Multi-stage retrieval: document-type ranking, chunk-level passage selection |
Each dataset employs a taxonomy capturing key financial analysis practices. For example, Finance Agent Benchmark (Bigeard et al., 20 May 2025) distributes tasks across nine categories from simple quantitative retrieval to complex modeling and cross-company market analysis. FinGAIA (Zeng et al., 23 Jul 2025) structures its tasks into seven sub-domains and three complexity levels ("Basic Business Analysis," "Asset Decision Support," "Strategic Risk Management"), and agentic workflow elements (e.g., customer data analytics, portfolio allocation) are distributed systematically across these axes.
3. Task Formats, Tools, and Scenario Structures
FinAgentBench datasets share core characteristics regarding task format and agent interaction:
- Input structure: Each instance specifies a question prompt (may include scenario depth, subdomain, and case identifier), relevant attachments (PDFs, Excel files, images, audio files), and the expected output format—structured text, numerics, code results, or extracted passages.
- Tool ecosystem:
- FinGAIA requires agents to invoke browsers, file parsers (PDF, Excel, audio), and Python computational environments as part of the solution process (Zeng et al., 23 Jul 2025).
- Finance Agent Benchmark equips agents with GoogleSearch (SerpAPI), EdgarSearch (EDGAR filings), ParseHTML, and RetrieveInformation (for prompt augmentation and chunk retrieval) (Bigeard et al., 20 May 2025).
- Workflow design: Many tasks are multi-step, requiring agents to plan an action sequence that traverses multiple sources and tool outputs, sometimes chaining over a dozen logic steps (notably in "Strategic Risk Management" and "Financial Modeling" tasks).
- Hierarchical scenario depth: Tasks are stratified to reflect operational, decision, and strategic layers, with increasing requirements for integrating disparate modalities and tools.
- Expert curation: All datasets rely on domain specialists (e.g., ex-Goldman Sachs/J.P. Morgan associates) for question and answer validation, solution outline development, and multi-stage review. This ensures domain fidelity and scenario plausibility (Zeng et al., 23 Jul 2025, Bigeard et al., 20 May 2025).
4. Evaluation Protocols and Performance Metrics
FinAgentBench benchmarks employ rigorous, task-specific evaluation metrics to capture agent performance:
- FinGAIA (Zeng et al., 23 Jul 2025): Accuracy () is used, with strict criteria—an answer must exactly match the standard in value, format, and semantics. Incorrect or unassessed outputs (due to file handling failures) are excluded from aggregate accuracy.
- Finance Agent Benchmark (Bigeard et al., 20 May 2025): Reports both Naive Accuracy (proportion of exact matches overall) and Class-Balanced Accuracy (), to avoid skew due to uneven task distribution. Additional tracking of average time and cost per query is provided.
- FinAgentBench: Agentic Retrieval (Choi et al., 7 Aug 2025): Multi-stage ranking metrics for retrieval subtasks:
- nDCG@5: normalized Discounted Cumulative Gain at cutoff 5, reflecting graded relevance.
- MAP@5: Mean Average Precision at cutoff 5.
- MRR@5: Mean Reciprocal Rank at cutoff 5 for the highest-ranked relevant passage.
- For fine-grained chunk-level judgment, TREC-style labels (irrelevant, partially relevant, directly relevant) are assigned by dual annotators with adjudication.
Performance results illustrate the current limitations of leading commercial LLMs, even when harnessed with tool augmentation. For instance, in FinGAIA, the best-performing agent, ChatGPT (DeepResearch), achieved 48.9% overall accuracy, over 35 percentage points behind human experts (84.7%). In retrieval-centric FinAgentBench, state-of-the-art LLMs (GPT-o3, Claude-Sonnet-4) scored nDCG@5 up to 0.783 for document selection and 0.419 for passage selection; fine-tuning models on 10% of the training set yielded measurable but not decisive performance gains (Zeng et al., 23 Jul 2025, Choi et al., 7 Aug 2025).
5. Error Typology and Failure Patterns
Comprehensive error analyses have been performed on agent outputs:
- Key failure modes (FinGAIA) (Zeng et al., 23 Jul 2025):
- Cross-modal Alignment Deficiency: Inaccurate integration of multimodal inputs (image, web, PDF).
- Financial Terminological Bias: Systematic confusion of specialized terms (e.g., "Size factor" vs. "market capitalization factor").
- Operational Process Awareness Barrier: Misapplication or misunderstanding of financial workflow convention or regulation.
- Hallucinatory Financial Reasoning: Generating unsupported or spurious financial content.
- Entity-Causation Misidentification: Faulty attribution of outcome to entity features.
- Data Type Handling Error: Inability to process certain file types, leading to unscorable attempts.
- General observations: Even with advanced tool use, LLM-based agents routinely struggle with sequential inference, ambiguous reporting structures, and the dense or regulated nature of financial disclosures. This suggests the need for architecture or prompt modifications that strengthen cross-modal reasoning and business process modeling.
6. Data Accessibility and Best Practices
FinAgentBench datasets are distributed under open academic licenses, with partial example data and execution harnesses available via GitHub or platforms such as HuggingFace and Zenodo (Zeng et al., 23 Jul 2025, Bigeard et al., 20 May 2025). Typical practices for robust use and fair comparison include:
- Explicit separation of training, validation, and test splits to avoid contamination.
- Adoption or careful extension of rubric-based or grounded evaluation schemes.
- Logging of all agent tool actions and reasoning steps for analysis and replication.
- Disclosure of cost, time, and failure rates alongside raw accuracy to facilitate cost–benefit tradeoff analysis, especially given that Pareto improvements diminish at high cost per query (Bigeard et al., 20 May 2025).
7. Research Applications and Future Directions
FinAgentBench benchmarks are positioned as objective, extensible platforms for:
- Pre-training and fine-tuning LLMs on tool-augmented, finance-specific reasoning and retrieval.
- Advancing research in agentic workflows, multi-step tool reasoning, and cross-modal alignment.
- Multi-agent collaboration paradigms, where specialized agents coordinate on financial task decompositions.
- Expanding to new financial domains (decentralized finance, ESG analytics), document types (credit agreements, prospectuses), or geographies.
- Coupling retrieval and answer-generation into end-to-end agentic pipelines; integrating dynamic indexing, adaptive chunking, or retrieval-augmented generation.
- Evaluating online learning and self-improving agent architectures under realistic market or regulatory conditions (Zeng et al., 23 Jul 2025, Choi et al., 7 Aug 2025, Bigeard et al., 20 May 2025).
These directions reflect a consensus that the current generation of LLM agents—despite substantial progress—remains less reliable than human experts for professional finance applications, with overall accuracy below 50% and clear algorithmic and workflow limitations. Continuous benchmarking using FinAgentBench datasets is expected to play a central role in closing this performance gap.