BankerToolBench (BTB) Benchmark

Updated 19 April 2026

BankerToolBench (BTB) is a comprehensive benchmark that evaluates AI agents across multi-step, multi-artifact investment banking workflows using authentic junior banker tasks.
It employs automated evaluations with over 100 expert-crafted rubric criteria, ensuring technical accuracy, client-readiness, and regulatory alignment.
Empirical results highlight a significant performance gap in current models, guiding future research toward robust agentic AI implementation in professional banking.

BankerToolBench (BTB) defines a comprehensive, open-source benchmark for the evaluation of AI agents in end-to-end investment banking workflows. Developed with input from 502 investment banking professionals, BTB addresses the absence of rigorous, ecologically valid benchmarks for high-stakes, multi-artifact professional tasks. It operationalizes authentic junior banker workflows, requiring agents to execute senior banker requests through data gathering, analytical modeling, tool use, and the preparation of multi-file deliverables (Excel, PowerPoint, PDF, and Word). Automated agent assessment leverages over 100 rubric criteria per task, capturing technical accuracy, presentation quality, consistency, regulatory alignment, and overall client-readiness. Empirical results demonstrate the fundamental performance gap of current frontier models, identifying critical obstacles and guiding future research in agentic AI for professional services (Lau et al., 13 Apr 2026).

1. Motivation and Benchmark Design Principles

Existing LLM benchmarks, such as FinQA, TAT-QA, GDPVal, and APEX-Agents, typically restrict their evaluation scope to isolated capabilities—question-answering, retrieval, basic tool use, or single-artifact outputs. In contrast, junior investment bankers undertake multi-step, multi-artifact workflows (e.g., valuation modeling, pitchbook generation, memo drafting) that frequently require up to 21 labor hours per assignment and can entail economic risk on the order of hundreds of millions of dollars. No prior benchmark captures this end-to-end workflow fidelity, including professional conventions and requirements for client-readiness evaluation.

BTB adopts four core design goals:

Ecological validity through task, data, and tool environments that reflect authentic junior banker responsibilities.
End-to-end scope from interpreting senior banker instructions to generating and assembling multi-file deliverables.
Fine-grained, expert-crafted rubrics comprising no fewer than 100 criteria per task, enabling nuanced, partial-credit scoring aligned with genuine client-ready standards.
Automated, reproducible evaluation utilizing an agentic verifier framework.

Benchmark development involved extensive collaboration: surveys with 193 investment banking practitioners across all product groups (M&A, LevFin, ECM, DCM), deep-dive interviews with eight senior bankers to refine the taxonomy, and 172 bankers (mean 3.4 years’ experience) devoting 5,700+ hours to authoring, reviewing, and grading 100 benchmark tasks.

The simulated “data room” environment provides static files (PDFs, PowerPoint, Excel, images) mirroring real deal documentation and supports Model Context Protocol (MCP) tools, such as a FactSet/CapIQ-like market data platform, an SEC EDGAR API, and a Company Profile API. Historical cutoff dates enforce reproducibility and prohibit internet access.

2. Task Suite Structure and Workflow Specification

BTB features 100 tasks, stratified by investment banking product group and workflow type. Task categories reflect the market distribution of authentic junior banker activities:

Product Group	% of Tasks
Mergers & Acquisitions (M&A)	62%
Leveraged Finance (LevFin)	19%
Equity Capital Markets (ECM)	10%
Debt Capital Markets (DCM)	6%
Hybrid (M&A+LevFin)	3%

Workflows encompass financial modeling and scenario analysis (37%), valuation and pricing (30%), client/marketing material preparation (27%), with additional coverage of market analysis, process management, and aftermarket performance.

Representative subcategories include Discounted Cash Flow models (18 tasks), Leveraged Buyout/Credit models (15), Trading Comparables (10), Pitchbooks (8), and Merger Models (6). Inputs span data room files (e.g., PDFs for CIMs, regulatory filings, legacy decks), and live-style MCP retrieval calls (e.g., mcp.get_filing(…)).

The canonical workflow for a BTB task includes:

Identification and acquisition of relevant files or API-based retrievals.
Data extraction and transformation (signal extraction from tabular or unstructured PDF, Excel, etc.).
Creation or adaptation of financial model templates using libraries such as openpyxl, python-pptx, and python-docx.
Precise population of formulas, code blocks, forward projections, ratios, and sensitivities.
Automated spreadsheet recalculation (typically via headless LibreOffice).
Assembly of deliverables: multi-tab Excel models, PowerPoint pitch decks, and PDF/Word reports.

Deliverables must satisfy professional formatting, layout, and branding conventions, with transparent, auditable code, formulas, and all key assumptions.

3. Rubric and Scoring Methodology

BTB employs an automated, task-specific rubric for each benchmark item, consisting of 100–200 binary criteria indexed by $i \in [1..N]$ , classified into:

Technical Correctness (methodology, model integrity)
Client Readiness & Presentation
Instruction Following
Transparency & Auditability
Internal Consistency
Risk & Compliance

Rubric items are weighted $w_i \in \{1,\,3,\,5,\,10\}$ , reflecting domain experts’ view of their criticality (with “critical” items such as correct EPS calculations or balance sheet integrity having the highest weight).

Score for task $t$ : $\mathrm{Score}_t = \frac{\sum_{i=1}^N w_i\,p_i}{\sum_{i=1}^N w_i} \times 100,\quad p_i\in\{0,1\}$ The overall benchmark score for an agent is the mean $\mathrm{Score}_t$ across all 100 tasks. Passing a task is defined as achieving $\mathrm{Score}_t \ge 80\%$ , consistent with human “needs light edits” or “sendable” grading.

4. Empirical Performance Evaluation

Nine advanced models were evaluated under uniform infrastructure via the OpenCode agent harness. Model performance, averaged over three runs, is summarized as:

Model	Mean Score (± std)	Pass@1 (%)	Pass@3 (%)
GPT-5.4	52% ± 2%	16	22
GPT-5.2	49% ± 1.5%	-	-
Gemini 3.1	46% ± 2%	-	-
Claude 4.6	43% ± 1%	-	-
Others	30–40%	-	-

Bankers assessed 0% of GPT-5.4’s outputs as “sendable as-is.” Comparative win-rates show GPT-5.4 outperforming GPT-5.2 on 70% of tasks, and GPT-5.2 beating Grok 4 on 98%. These findings underscore that, even at the frontier, existing models are inadequate for full end-to-end workflow delegation in professional investment banking contexts.

5. Failure Analysis and Pathways for Improvement

Rubric-level breakdown reveals the following best-category performances by GPT-5.4 (rounded):

Technical Correctness: ~57%
Internal Consistency: ~66%
Client Readiness: ~63%
Risk & Compliance: ~46%
Instruction Following / Auditability: ~40–50%

Major failure modes include:

Code & Formula Generation (41%): hallucinated or unsupported APIs, superficial error recovery, omission of real formulas, hard-coded numbers in place of dynamic calculations.
Reasoning & Logic (27%): application of valuation techniques to inappropriate scenarios (e.g., P/E multiples for loss-making firms), misclassified line items.
Retrieval & Persistence (18%): repeated use of non-functional API calls, ignoring available hints for alternative retrieval strategies.
Grounding & Fabrication (13%): generation of hypothetical results without basis in provided documentation.

Case studies highlight balance sheet reconciliation errors, inconsistent reporting of key figures across deliverables, and failure to adhere to client branding specifications (e.g., incorrect color palettes).

Improvement strategies validated in BTB include:

Incorporating domain-knowledge—prompt engineering with banker-specific conventions yields score improvements of 10–20 points.
Enhanced code-generation—more robust API utilization, better test coverage for scripting.
Cross-artifact consistency checks—real-time audits across Excel, PowerPoint, and textual components.
Persistent retrieval strategies—incorporation of orchestration logic to systematically backtrack and retry data acquisition failures.
Human-in-the-loop hybrids—flagging non-robust outputs for human verification.

6. Economic and Practical Significance

BTB tasks require a median of 5 hours (maximum 21 hours) of experienced junior banker labor per assignment. Effective AI delegation thus promises substantial time and labor savings, with potential for hundreds of cumulative hours recovered across typical weekly task loads.

Market surveys indicate bankers’ willingness to pay $101–$500+/month for highly reliable automation, with managing directors at elite institutions assigning valuations of $50,000–$96,000/year to such assistants.

Given annual global investment banking fees exceeding $140 billion, even incremental productivity gains achieved through reliable AI delegation have substantial economic leverage.

BTB exposes the persistent reliability deficit preventing full delegation of investment banking workflows to contemporary agentic LLMs. Future production-grade assistants must integrate advanced orchestration, rigorous code validation, persistent cross-artifact auditing, and domain-specialized training. BTB thereby defines a research target that advances the intersection of agentic AI, high-stakes knowledge work, and reproducible, automatable professional assessment (Lau et al., 13 Apr 2026).

Markdown Report Issue Upgrade to Chat

References (1)

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to BankerToolBench (BTB).

BankerToolBench (BTB) Benchmark

1. Motivation and Benchmark Design Principles

2. Task Suite Structure and Workflow Specification

3. Rubric and Scoring Methodology

4. Empirical Performance Evaluation

5. Failure Analysis and Pathways for Improvement

6. Economic and Practical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

BankerToolBench (BTB) Benchmark

1. Motivation and Benchmark Design Principles

2. Task Suite Structure and Workflow Specification

3. Rubric and Scoring Methodology

4. Empirical Performance Evaluation

5. Failure Analysis and Pathways for Improvement

6. Economic and Practical Significance

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research