BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Published 13 Apr 2026 in cs.AI | (2604.11304v1)

Abstract: Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.

Abstract PDF Upgrade to Chat

Authors (27)

First 10 authors:

Summary

The paper introduces BankerToolBench as a novel benchmark that assesses AI agents across full-cycle investment banking workflows with expert-curated rubrics.
Methodologically, it employs multi-artifact deliverables and an agent-as-a-judge evaluation to simulate real-world tasks like DCF analysis and pitchbook creation.
Results show frontier models, including GPT-5.4, display low client-readiness due to errors in code generation, inconsistent reasoning, and challenges in long-horizon planning.

Evaluating LLM Agents in Investment Banking: An Analysis of BankerToolBench

Motivation and Benchmark Overview

BankerToolBench (BTB) addresses an acute deficit in high-fidelity benchmarks for evaluating the real-world utility of AI agents in professional, economically consequential domains. Focusing on junior investment banking workflows, BTB is the first open-source benchmark to require agents to complete end-to-end analytical tasks encompassing unstructured requests, authentic data rooms, industry tool access, and the generation of multi-artifact deliverables. Constructed with direct input from 502 investment bankers, BTB is firmly grounded in an ecological validity paradigm, spanning realistic tasks such as discounted cash flow analyses, leveraged buyout models, and pitchbook preparation—each requiring heterogeneous deliverables processed by RL agents in a sandboxed environment.

Figure 1: BTB benchmark structure spans request intake, tool use, deliverable generation, and agentic rubric-driven evaluation.

Each BTB task is evaluated via fine-grained rubrics curated by veteran practitioners, typically featuring around 150 pass/fail criteria weighted by criticality. Automated evaluation is performed using an agent-as-a-judge verifier that leverages LLM-powered code execution to inspect deliverables—a necessity given the complexity of artifacts such as Excel workbooks or PowerPoint presentations, which cannot be reliably scored by serialization-based approaches or simplistic text-based verifiers.

Benchmark Construction and Fidelity

BTB explicitly mitigates the limitations observed in previous financial or economic utility benchmarks (e.g., GDPval, APEX-Agents, Finance Agent Benchmark) by:

Focusing on depth within a single high-impact domain, rather than breadth across multiple occupations.
Requiring multi-file, interdependent deliverables typical of IB, with rigorous human-in-the-loop task and rubric review.
Using stakeholder-derived rubric design, supported by a job task analysis (JTA) survey and interviews with actual bankers to calibrate the blueprint for representative work.
Including edge-case complexity, e.g., ambiguous or underspecified prompts, judgment-heavy requirements, and data imperfections, yielding tasks that demand synthesis, precision, and cross-artifact consistency.

A multi-phase data validation pipeline ensures that each BTB task, from prompt to final scoring, undergoes stringent curation and review for realism, completeness, rubric coverage, and grading fidelity.

Figure 2: The BTB data pipeline, encompassing task authorship, rubric creation, expert review, and rigorous acceptance criteria.

Experimental Setup and Evaluation

Nine frontier models, including GPT-5.2/5.4, Claude Opus 4.5/4.6, Gemini 2.5/3.1 Pro, Grok 4, Qwen 3.5 397B, and GLM 5, are systematically evaluated under a standardized agentic harness. Deliverables are automatically scored using BTB's rubric-driven, agentic verifier. The principal metrics are: (1) Score (weighted rubric pass percentage), (2) Pass@1 and Pass@3 (fraction of tasks above stakeholder-aligned acceptability thresholds), and (3) more stringent criticality-based thresholds corresponding to client-readiness.

Results: Performance and Failure Analysis

Frontier models demonstrate nontrivial but inadequate proficiency:

GPT-5.4 achieves the highest Score, passing roughly 16% of all tasks at an "acceptable" standard, but fails nearly half of rubric criteria and does not produce any outputs rated client-ready by bankers.
No model achieves full reliability; even at highly relaxed thresholds, Pass rates rarely exceed 20%.
Figure 3: Pass rates (tasks deemed "acceptable") for all models under single and best-of-three rollouts, reflecting low reliability.

Figure 4: Left: Mean Score across all BTB tasks. Right: Pairwise win-rate matrix across model pairs, showing consistent superiority of GPT-5.4.

Failure mode taxonomy indicates the majority of failures result from code or formula generation errors (e.g., use of non-existent APIs, hard-coding vs. formulaic logic in spreadsheets), flawed reasoning/judgment (incorrect application of IB concepts or analytical frameworks), ineffective retrieval/persistence strategies, and unprincipled data fabrication when retrieval fails.

Models reveal marked variation across rubric categories:

Figure 5: Model-wise performance breakdown by rubric category, with GPT models generally excelling at technical correctness but underperforming on presentation attributes.

GPT-5.4 leads in technical and internal consistency, but lags Opus 4.6 in client readiness and compliance.
All models exhibit significant brittleness in chained, long-horizon workflows—e.g., inconsistent propagation of assumptions, incoherent reconciliation between artifacts, and premature halting.

Agentic harness implementation has a modest impact on Score (within ±2 points), but significant effect on runtime and cost, with OpenCode being more efficient than OpenHands or Goose.

Task Difficulty, Domain Knowledge, and Sensitivity

Task characteristics strongly modulate model performance:

Figure 6: Score stratified by expected deliverable type, input file type, and IB product group. Model performance systematically varies with task complexity and artifact requirements.

PowerPoint-centric and ECM tasks are easier for current agents than quantitative, model-heavy analyses (e.g., DCF/LBO/merger models), particularly those demanding formula integrity and scenario toggling.
Providing richer IB-specific context (domain and formatting conventions) to prompts increases agent performance, suggesting acute sensitivity to prompt engineering and model pretraining/curriculum.

Post-training, Reward Signal Quality, and Human Alignment

BTB’s rubric Score serves as an effective objective for RLHF-based post-training. Applying Dr. GRPO and DPO to Qwen 3 4B/32B yields substantial ( $4–13\times$ ) relative improvements in holdout scores. The verifier demonstrates 88.2% agreement with human grades, comparable to inter-human agreement, and cost-effective configuration (Gemini 3 Flash Preview).

Theoretical and Practical Implications

BTB repositions the debate on LLM/agent economic utility: current models are not ready for full delegation in high-consequence professional domains. Their observed failure regimes (local errors triggering global deliverable invalidity) highlight that current progress on isolated capabilities—e.g., tool use, chain-of-thought reasoning—does not propagate reliably to end-to-end workflows. Task decomposition and human-in-the-loop paradigms are thus necessary (but ultimately inefficient for true labor substitution in IB or similar domains).

The benchmark's design principles—domain immersion, expert-aligned rubrics, multi-artifact delivery, and agentic verification—establish a template for evaluating LLM agents where “unit tests” are inadequate surrogates for economic or stakeholder value.

Conclusion

BankerToolBench constitutes a new standard for evaluating AI agents in high-stakes, knowledge-intensive workflows. Despite improvements on fine-grained capability metrics, frontier models remain brittle on holistic, cross-artifact investment banking work. The benchmark’s results underscore the need for advances in long-horizon planning, complex artifact generation, and the integration of domain-specific judgment. Wide adoption of environments constructed with BTB's principles will be crucial for aligning AI evaluation—and ultimately training—with genuine economic value and deployment readiness.

Markdown Report Issue

Paper to Video (Beta)

All Videos Subscribe on YouTube

Whiteboard

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

off on

Knowledge Gaps

off on

Practical Applications

off on

Glossary

off on

Conceptual Simplification

off on

Explain it Like I'm 14

A simple guide to “BankerToolBench: Evaluating AI Agents in End‑to‑End Investment Banking Workflows”

What is this paper about?

This paper introduces BankerToolBench (BTB), a big, realistic “test” for AI systems to see if they can do the kinds of tasks junior investment bankers do from start to finish. Instead of just answering questions, AIs have to gather real information, build spreadsheets, make slides, and write reports—just like a person would on the job.

Think of it like a full school project: not just answering a quiz, but researching, doing calculations, making a slideshow, and turning in everything neatly and correctly.

What questions were the researchers trying to answer?

Can today’s top AIs actually do complex, professional work—not just short answers—on their own?
If we ask an AI to complete real banking tasks (like building a financial model and a pitch deck), how good are the results?
Where do AIs mess up most, and how can we improve them?

How did they study it?

The team built a realistic testing setup with the help of 502 real investment bankers from top firms. They designed 100 end-to-end tasks that junior bankers regularly do.

Here’s how the test works, in everyday terms:

The AI gets a realistic “boss request” (the prompt), like “build a company valuation and make a client-ready slide deck.”
The AI works inside a locked-down “sandbox computer” with:
- A “data room” (like a digital folder of real files)
- Tools similar to what bankers use (for example, getting stock prices or reading SEC filings)
- Common software to make Excel spreadsheets, PowerPoint slides, and Word/PDF reports
The AI must produce real files—often multiple files that must agree with each other.
The results are graded with detailed checklists (rubrics) written by veteran bankers.

To grade fairly and consistently, they built an automated “referee” agent called Gandalf:

It opens the files, runs code when needed, checks formulas in Excel, looks for specific numbers, and verifies formatting and consistency across files.
It uses a pass/fail checklist with weights for importance (e.g., “critical” issues matter most).
Its grades match human banker judgments closely, so it’s a reliable grader.

The researchers then ran 9 top AI models through the same setup to compare performance.

What did they find?

Even the best AI struggled. The top model (called GPT‑5.4 in the paper) still failed nearly half of the grading criteria on average.
Bankers rated zero of the AI-produced deliverables as truly client‑ready.
Only a small fraction of tasks met a reasonable quality bar (the best model passed about 16% of tasks by a more lenient standard).
AIs often tripped over “human” details that matter in real work:
- Keeping numbers, names, and facts consistent across many files and tabs
- Using correct, auditable formulas (not just typing in numbers)
- Following professional formatting and naming conventions
- Clearly showing where data came from and what assumptions were made
- Including proper disclaimers and risk notes

Why this is important: A single banking task can take a human up to 21 hours. If AI could reliably do this, it would save huge amounts of time and money. But today’s AIs aren’t consistently reliable enough for high‑stakes client work.

Why does this matter, and what’s the impact?

For businesses: Don’t expect current AIs to replace junior bankers yet. They can help with parts (like drafting or data lookup), but full handoff is risky.
For AI developers: BTB shows exactly where AIs fail in real workflows—especially cross‑file consistency, correct formulas, auditability, and “client polish.” This gives a clear roadmap for improvements.
For researchers: BTB is open-source, realistic, and repeatable. It moves beyond toy tests to measure whether an AI can deliver complete, usable work products.
For the economy: As AIs get better at these complex tasks, the potential productivity gains are large. BTB helps track real progress toward that goal, not just scores on simple benchmarks.

In short: The paper builds a tough, realistic test for AI in investment banking. Today’s AIs can’t consistently meet professional standards for end-to-end tasks, but this benchmark shows where to focus so future AIs can become truly useful for high‑skill jobs.

View Paper Prompt View All Prompts

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Below is a consolidated list of what remains missing, uncertain, or unexplored in the paper, stated concretely so future work can address them.

External validity beyond U.S. public-company contexts is untested:
- Tasks, tools, and rubrics rely on SEC/EDGAR and U.S.-centric conventions. How well does performance transfer to IFRS regimes, non-U.S. regulatory filings (e.g., SEDAR, Companies House), and non-English deliverables?
- Private-company workflows (limited disclosures, sparse data) are minimally represented.
Task distribution representativeness and coverage:
- Benchmark skews to M&A (62%) with small-n subcategories (1–3% each); statistical reliability for rare subcategories is unclear. How sensitive are model rankings to this mix?
- Many high-value IB tasks (e.g., live deal iteration, diligence coordination, client Q&A prep, negotiation strategy) are not modeled.
Realism of software/tooling environment:
- Environment uses LibreOffice and Python libraries rather than Microsoft Office; no VBA/macro support, add-ins, or bank-specific templates. How much does this misalign with real banker workflows (e.g., Excel Data Tables, circulars, firm-standard macros)?
- Only three MCP tools are provided (market data, EDGAR, company profile). How would results change with richer ecosystems (e.g., Bloomberg, Capital IQ, PitchBook, SNL, Reorg, Refinitiv), internal knowledge bases, and asset libraries?
- Visual fidelity of PowerPoint outputs (fonts, templates, color palettes, brand lockups) is not fully evaluated under enterprise Office rendering.
Evaluation and verifier robustness:
- A single agent-as-judge (Gandalf) with one underlying LLM (Gemini 3 Flash Preview) is used. Cross-judge robustness, cross-provider bias, and ensemble-judge agreement are not reported.
- Susceptibility to reward hacking/Goodhart effects is untested: can agents game rubrics (formatting, naming) without improving true client readiness?
- Adversarial inputs and file pathologies (linked workbooks, hidden sheets, external references, protected cells, corrupted files) are not stress-tested.
- No analysis of false positives/negatives by criterion category (e.g., formula integrity vs presentation polish) or failure modes specific to spreadsheets vs slides vs PDFs.
Rubric design and validity:
- Criterion weights (1/3/5/10) and category allocations are based on expert judgment; no sensitivity analysis shows whether model rankings are stable to different weights.
- Cross-firm convention diversity is not modeled (banks differ on tab naming, color coding, disclaimer standards, KPI definitions). How portable are rubrics across firms and sectors?
- Alternative valid methodologies (e.g., valuation choices, normalizations, treatment of non-GAAP metrics) may be penalized. How tolerant are rubrics to justified deviations?
Human baselines and ceilings:
- Human-authored “expected deliverables” are not graded to establish human Score distributions. What is the human performance ceiling, interquartile range, and inter-rater spread by task type?
- Time/error trade-offs for humans vs agents and mixed-initiative setups (co-pilot workflows) are not quantified.
Economic metrics and feasibility:
- Cost/latency/resource usage per task (e.g., 539 LLM calls) is not reported; ROI relative to human hours and compute costs remains unknown.
- Pass/Score thresholds mapping to “client-ready” are derived from an internal study; external validation across banks and MD-level standards is not presented.
Generalization and transfer:
- Robustness to market regime shifts (e.g., zero-rate era vs tightening cycles), differing sector structures (regulated industries, financials/insurance), and cross-cycle assumptions is not studied.
- OOD resilience to missing, lagged, noisy, or conflicting source data is unmeasured.
Task process realism:
- Multi-actor workflows (analyst–associate–VP–MD feedback cycles), redlines, comment resolution, and rapid “turns” under time pressure are not simulated.
- Communication artifacts (emails, Slack threads), meeting pre-reads, and verbal ambiguity negotiation are absent.
Internet and knowledge access:
- Internet is disabled; many real tasks require web research, press releases, primary sources, and vendor portals. How does enabling safe web access change outcomes and failure modes?
Dataset governance and leakage:
- Open rubrics and sample expected deliverables introduce risks of overfitting/benchmark gaming; no hidden test sets or periodic refresh strategy is described.
- Judge–agent provider coupling is partly mitigated (Gemini as judge), but training-time exposure to rubric patterns is not controlled; contamination risks are not audited.
Reliability and reproducibility:
- MCP data fidelity and versioning are not documented (e.g., if the API schema or backfilled data change). How reproducible are scores across time and environment versions?
- Library/version sensitivity (pandas/openpyxl/libreoffice) and sandbox determinism are not benchmarked.
Failure mode taxonomy depth:
- The paper notes cross-artifact consistency breakdowns but does not release a fine-grained, labeled taxonomy of error types or per-category error rates to guide targeted model improvements.
Agent harness dependence:
- Only a baseline harness is primary; systematic ablations on planning, memory, tool-use strategies, and self-verification loops are not reported. How large is the harness effect relative to the base model?
Safety and compliance evaluation:
- Although “Risk & Compliance” criteria exist, there is no dedicated evaluation of hallucinated legal claims, inappropriate comparative statements, or misguiding financial advice risks and mitigations.
Internationalization and accessibility:
- Non-English tasks, region-specific formats (e.g., decimal/thousand separators, date conventions), and localization of disclosures and legal language are not included.
Benchmark scope and maintenance:
- At 100 tasks, coverage of the long tail of workflows is limited; update cadence, task refresh processes, and sunset policies for stale tasks are unspecified.
Ethical and workforce impacts:
- The broader implications of automating junior IB workflows, skill erosion, and oversight frameworks are not addressed.
Practical deployment questions:
- How to integrate agent outputs into bank IT (DLP controls, model risk management, audit trails) is unexamined.
- How well do agents interoperate with proprietary data entitlements and compliance constraints (e.g., restricted lists, MNPI handling)?
Post-training to the verifier:
- While the verifier is “good enough for RL,” no empirical results show that optimizing to Gandalf improves human-rated client readiness without overfitting to rubric artifacts.
Multimodal evaluation depth:
- Grading of charts/tables/visuals emphasizes structural checks; calibration to perceived clarity, information density, and executive readability is not validated with blinded human panels.
Statistical power and reporting:
- Confidence intervals for overall model rankings, per-category scores, and task-subset analyses (e.g., LBO vs DCF) are not reported; small-sample variance for rare subcategories is unclear.

These gaps suggest concrete avenues for future work: add non-U.S./private-company tasks; integrate Microsoft Office and VBA; expand/validate rubrics across firms and jurisdictions; ensemble multiple verifiers and conduct sensitivity analyses; establish hidden test sets; report human baselines, cost/latency; simulate multi-actor workflows and feedback cycles; enable safe web tools; and publish a detailed error taxonomy with per-category metrics.

View Paper Prompt View All Prompts

Practical Applications

Immediate Applications

The following applications can be deployed today using the released benchmark, verifier, and environment. They translate BTB’s tasks, rubrics, and agentic verifier into concrete workflows for industry, academia, policy, and education.

Benchmark-driven vendor selection and procurement (Finance, Software/AI)
- Use BTB as an internal “fitness test” to compare LLMs/agents for investment banking tasks before purchase or deployment, using pass thresholds and rubric-weighted scores aligned to client-readiness.
- Tools/workflows: Harbor environment for reproducible runs; Gandalf agent-as-a-judge for automated grading; CI dashboards for model score tracking.
- Assumptions/dependencies: Compute/ops capacity to run BTB; adaptation of rubrics to firm conventions; sandboxed environment (no internet) consistent with BTB’s historical data assumption.
Human-in-the-loop QA gates for client deliverables (Finance)
- Integrate Gandalf to auto-check Excel/PPT/Word artifacts (formulas, cross-file consistency, rubric criteria) prior to sending to VPs/MDs/clients; flag critical failures (e.g., missing disclosures, broken formulas).
- Tools/workflows: Gandalf integrated with document management systems; Office automation via openpyxl/python-pptx; rubric-driven “deliverable linter.”
- Assumptions/dependencies: Legal/compliance approval for automated checks; adaptation to firm formatting/naming standards; accurate rubric mapping to teams’ workflows.
Analyst onboarding and upskilling with automated feedback (Finance, Education)
- Use BTB tasks and expected deliverable exemplars as training labs; Gandalf provides granular, rubric-aligned feedback on Excel models, pitchbooks, and memos.
- Tools/workflows: Internal training portals with BTB tasks; automated scoring and feedback loops; progress dashboards by rubric category (e.g., Technical Correctness vs. Client Readiness).
- Assumptions/dependencies: Customization to firm-specific conventions; time allocation for training; governance of scored assessments.
Continuous evaluation/regression testing for internal copilots (Finance, Software/AI)
- Treat BTB as a standardized test suite to catch performance regressions when updating model versions, tools, or prompts for internal AI assistants.
- Tools/workflows: Harbor-based CI pipelines; pass@k thresholds; alerts tied to high-importance rubric criteria (critical/major).
- Assumptions/dependencies: Stable compute budget; version control around prompts/agents; disciplined MLOps.
Targeted model improvement via BTB-guided post-training (Software/AI, Research)
- Use Gandalf’s pass/fail signals as a reward for RL post-training on investment banking workflows; prioritize failure modes (e.g., cross-artifact consistency) revealed by BTB.
- Tools/workflows: RL pipelines; reward shaping by rubric category; ablations on memory, tool-use reliability, formula generation.
- Assumptions/dependencies: Verifier accuracy ~88% (sufficient but imperfect); guard against reward hacking; compute and data budgets.
Pre-deployment risk assessment and policy setting (Finance, Policy/Governance)
- Translate BTB score distributions and pass rates into internal risk policies (e.g., “AI outputs must pass all critical criteria and be human-reviewed”); define deployment scope (draft-only vs. client-ready).
- Tools/workflows: Risk dashboards; approval workflows gated by BTB-equivalent checks; exception handling.
- Assumptions/dependencies: Executive buy-in; harmonization with existing model risk management frameworks.
Failure-mode diagnostics and tool-chain hardening (Finance, Software/AI)
- Use BTB failure analysis to prioritize engineering investments (e.g., spreadsheet formula integrity, cross-file name/number reconciliation, compliance disclaimers).
- Tools/workflows: Consistency checkers for Excel/PPT; structured “artifact graphs” tying values/formulas/labels across files; error surfacing in IDE-like UIs.
- Assumptions/dependencies: Instrumentation to track cross-artifact links; Office suite automation; data provenance capture.
Academic research platform for agentic AI (Academia/Research)
- Study long-horizon reasoning, tool-use reliability, and multi-artifact generation; compare harnesses (e.g., OpenHands/OpenCode) under controlled, reproducible conditions.
- Tools/workflows: Harbor experiments; open-source Gandalf; rubric taxonomy for research on evaluation validity and psychometrics.
- Assumptions/dependencies: Access to model APIs or open models; compliance with BTB’s historical-date constraint.
Recruiting and performance assessment (Finance, Education)
- Deploy BTB tasks as standardized, auto-graded exercises for candidate screening or periodic skill checks for analysts/associates.
- Tools/workflows: Proctored sandboxes; Gandalf scoring with category breakdowns; benchmark-based thresholds by role level.
- Assumptions/dependencies: Fairness/legal review; anti-cheating controls; localization to region-specific accounting practices.
Productization of “grading-as-a-service” for enterprise documents (Software/AI)
- Offer Gandalf-like agent-as-a-judge for non-IB artifacts (e.g., board decks, FP&A models, sales proposals) to ensure structure, formulas, and disclosures meet internal standards.
- Tools/workflows: API for document upload; customizable rubrics; integrations with SharePoint/Box/Drive.
- Assumptions/dependencies: Domain-specific rubrics; verifier accuracy tuning; liability management.

Long-Term Applications

These applications require further model reliability, scaling, or organizational changes. They extend BTB’s methods and signals into broader automation, governance, and cross-industry standards.

Partial automation of junior banker workflows with human oversight (Finance)
- Agents prepare first-draft LBO/DCFs, trading comps, and pitchbooks that satisfy critical rubric criteria; bankers focus on judgment and storytelling.
- Tools/workflows: Generator+checker (Gandalf) pipelines; structured assumption templates; automated sensitivity tables and data tables.
- Assumptions/dependencies: Significant gains in cross-artifact consistency, formula correctness, and instruction following; robust audit trails.
Autonomous multi-file deliverable copilot integrated with Office and data rooms (Finance, Software/AI)
- A “deal workspace” that ingests market/SEC data, updates models, refreshes slides, and maintains consistent numbers and labels across artifacts.
- Tools/workflows: Bi-directional connectors to market data/EDGAR; artifact-graph to enforce coherence; versioned update logs.
- Assumptions/dependencies: Model upgrades; strong IT integration; permissions and data governance.
Standardized regulatory benchmarks for high-stakes AI in finance (Policy/Regulation)
- Regulators adopt BTB-like benchmarks and pass thresholds as preconditions for autonomous AI use in client-facing materials or transaction execution.
- Tools/workflows: Certification workflows; public scorecards; audit packages including rubric coverage and verifier reliability.
- Assumptions/dependencies: Industry consensus; regulator capacity to maintain evolving standards; vendor cooperation.
Generalized agent-as-a-judge platforms for complex enterprise artifacts (Software/AI, Cross-industry)
- “Gandalf for X” services for legal contracts, clinical documentation, energy project finance models, or safety-critical SOPs.
- Tools/workflows: Domain rubrics co-developed with practitioners; file-type specific instrumentation (e.g., CAD, EHR, BIM).
- Assumptions/dependencies: High-quality, domain-specific rubrics; demonstrable inter-rater agreement vs. experts.
Cross-artifact consistency engines and integrity checkers (Software/AI)
- Systematically enforce consistency across numbers, formulas, and labels spanning Excel, PPT, and Word, with provenance and dependency tracing.
- Tools/workflows: Artifact graphs; constraint solvers for formula structures; reconciliation reports with suggested fixes.
- Assumptions/dependencies: Standardized file interfaces; agent reasoning about dependencies; enterprise adoption.
RL ecosystems and curriculum learning for professional workflows (Software/AI, Academia)
- Use BTB tasks and Gandalf rewards to build curriculum schedules (from simpler valuation tasks to full sell-side pitchbooks), advancing agent reliability.
- Tools/workflows: Self-play and iterative improvement; task difficulty scaffolding; reward decomposition by rubric category.
- Assumptions/dependencies: Stable, high-quality reward signals; mitigation for reward hacking; compute.
Auditability, transparency, and provenance toolchains (Finance, Policy)
- End-to-end capture of assumptions, data sources, and modeling choices with change logs linking cells, slides, and footnotes.
- Tools/workflows: Provenance metadata embedded in files; auto-generated appendix pages; audit reports aligned to risk/compliance rubrics.
- Assumptions/dependencies: Organizational requirements for explainability; standardized metadata schemas.
Expansion of BTB methodology to adjacent domains (Consulting, Law, Healthcare, Energy)
- Create deep, domain-specific benchmarks and verifiers for client deliverables (e.g., legal closing binders, clinical care plans, project finance models).
- Tools/workflows: Job Task Analysis per domain; practitioner-authored rubrics with 100+ criteria; environment/tooling adapted to domain software.
- Assumptions/dependencies: Expert time investment; data-sharing constraints; validation against practitioner standards.
Procurement standards and benchmarking consortia (Finance, Policy)
- Banks and vendors align on shared scoring frameworks (rubric coverage, category weighting) to reduce vendor lock-in and improve comparability.
- Tools/workflows: Common evaluation profiles; third-party auditors; shared leaderboards with interpretability (pass@k, critical-failure rates).
- Assumptions/dependencies: Industry collaboration; legal frameworks for score sharing.
Live-deal assistants and continuous update pipelines (Finance)
- Agents monitor market/filing changes, auto-update models and slides, and surface material changes for banker sign-off.
- Tools/workflows: Event-driven triggers; change highlighting; approval queues and checklists.
- Assumptions/dependencies: Reliable detection of materiality; robust incremental updates without drift.
Multi-agent “generator–verifier–editor” workflows (Software/AI)
- Coordinated agents produce drafts, verify against rubrics, and propose remediations; escalate unresolved issues to humans.
- Tools/workflows: Orchestration frameworks; error taxonomy routing by rubric category; learning from human edits.
- Assumptions/dependencies: Stable interfaces and memory; clear escalation policies; alignment of incentives.
Education and credentialing based on deliverable competence (Education, Professional Certification)
- Certifications awarded via standardized, auto-graded tasks that reflect real analyst work quality and client readiness.
- Tools/workflows: Secure task delivery; proctoring; rubric-aligned grading reports and remediations.
- Assumptions/dependencies: Acceptance by employers; robust anti-cheating measures; periodic rubric updates.

These applications leverage BTB’s core innovations—profession-grade tasks, multi-file deliverables, expert-authored rubrics with 100+ criteria, and a file-aware agentic verifier—to enable practical deployment today and chart a path toward reliable automation of high-stakes professional workflows.

View Paper Prompt View All Prompts

Glossary

Acquisition Matrix: A grid used to evaluate potential acquisitions across defined criteria. "Acquisition Matrix"
Adjusted EPS: Earnings per share adjusted to exclude non-recurring or non-cash items. "EPS vs.\ Adjusted EPS"
agent-as-a-judge: An autonomous grading agent that evaluates deliverables against a rubric. "we developed a novel agent-as-a-judge verifier called Gandalf"
APEX-Agents: A benchmark assessing agentic systems on multi-step professional tasks. "APEX-Agents covers investment banking, corporate law, and management consulting."
Archipelago: A grading workflow used in APEX-Agents that serializes artifacts before scoring. "serialize-then-grade workflow implemented in Archipelago for APEX-Agents"
bid-ask spreads: The difference between the best prices to buy and sell, indicative of market liquidity. "tighter bid-ask spreads that improved market liquidity and price discovery"
Capitalization Table: A table detailing a company’s ownership, securities, and their dilution pre/post-transaction. "Capitalization Table (Pre/Post)"
client-ready deliverables: Outputs sufficiently accurate and polished to be sent to clients. "client-ready deliverables—including Excel financial models, PowerPoint pitch decks, and PDF/Word reports."
Confidential Information Memorandum: A detailed marketing document used in private company sales to inform prospective buyers. "Confidential Information Memorandum"
construct underrepresentation: A psychometric flaw where an assessment fails to capture the full scope of the intended construct. "a form of construct underrepresentation"
Covenant headroom: The cushion between current performance and debt covenant limits. "Covenant Headroom Prompts"
data provenance: Documentation of the sources and transformations behind reported figures or conclusions. "data provenance, assumption disclosure"
data room: A secure repository of deal documents and data for due diligence access. "navigating data rooms"
Debt Capacity Analysis: An assessment of how much debt a company can sustainably carry. "Debt Capacity Analysis"
Debt Capital Markets (DCM): The sector focused on raising capital via debt instruments. "Debt Capital Markets (DCM)"
Discounted Cash Flow (DCF): A valuation method based on discounting projected future cash flows. "Discounted Cash Flow Analysis (DCF)"
EDGAR: The SEC’s public database for company filings. "an SEC EDGAR API"
Equity Capital Markets (ECM): The sector focused on raising capital via equity issuance. "Equity Capital Markets (ECM)"
Excel Data Table: An Excel feature for creating sensitivity tables by varying inputs. "Sensitivity table is properly constructed using Excel Data Table functionality or equivalent formula array"
Gandalf: The agent-as-a-judge verifier used to grade complex files in BTB. "Gandalf is an agent, built on top of the production-quality OpenHands agent harness"
GDPVal: A cross-occupation benchmark evaluating agent economic utility. "GDPVal covers 44 occupations"
Harbor (environment): The framework that packages BTB tasks and sandboxes agent runs. "the \textsf{Harbor} environment for running the benchmark."
High-frequency trading: Algorithmic trading at very high speeds that exploits small price discrepancies. "A well-known example is high-frequency trading"
inter-rater agreement: The degree of consistency among different human graders. "human inter-rater agreement = 84.6\%"
Job Task Analysis (JTA): A structured method to identify and prioritize critical real-world job tasks. "a Job Task Analysis (JTA) survey"
Leveraged buyout (LBO): An acquisition funded primarily with debt secured by the target’s cash flows. "a leveraged buyout (LBO) task in BTB."
Leveraged Finance (LevFin): Financing focused on highly leveraged transactions like LBOs. "Leveraged Finance (Levfin)"
market data platform: An API providing structured financial data (prices, estimates, financials). "a market data platform API"
market liquidity: The ease with which assets can be traded with minimal price impact. "improved market liquidity"
MCP tools: Model Context Protocol tools for standardized agent tool-calling. "Agents can call three MCP tools"
Merger Model: A financial model evaluating the combined performance of two merging firms. "Merger Model"
Mergers & Acquisitions (M&A): Corporate transactions involving the buying, selling, or combining of companies. "Mergers {paper_content} Acquisitions (M{paper_content}A)"
model integrity: Correctness and coherence of formulas and logic within a financial model. "Model integrity graded"
OpenHands agent harness: A production-grade agent framework used to implement Gandalf. "the production-quality OpenHands agent harness"
Pass@k: The proportion of tasks passed within k sampled attempts. "Pass@1"
Pitchbook: A client-facing presentation used to pitch ideas or transactions. "a pitchbook PowerPoint presentation (preparing client materials)."
Precedent transactions: Valuation using multiples from comparable historical deals. "Precedent Transactions"
price discovery: The market process of determining asset prices from supply and demand. "price discovery"
Pro forma EPS: Earnings per share adjusted to reflect a proposed transaction’s effects. "Pro forma EPS calculations are correct: Pro Forma NI / Pro Forma Shares"
psychometric principles: Measurement standards guiding validity and reliability in assessment design. "BTB is grounded in psychometric principles"
reinforcement learning environment (RLE): An interactive setup where agents take actions and receive feedback. "BTB is designed as a reinforcement learning environment (RLE)"
SEC filings: Regulatory documents companies submit to the U.S. Securities and Exchange Commission. "retrieve SEC filings"
sensitivity table: A table showing how outputs change with variation in key inputs. "Sensitivity table is properly constructed using Excel Data Table functionality or equivalent formula array"
Sources & Uses (S&U): A schedule showing how a transaction is funded and what the funds are used for. "Sources {paper_content} Uses (S{paper_content}U)"
stratified expert survey: A sampling approach ensuring representation across predefined subgroups. "Stratified expert survey"
Teaser: A brief, anonymized marketing document sent to gauge buyer interest. "Teaser"
Trading comparables: Valuation based on trading multiples of peer companies. "Trading Comparables"
Use of Proceeds Analysis: A breakdown of where raised capital will be allocated. "Use of Proceeds Analysis"
Valuation ranges: Intervals of estimated company value based on applied methodologies. "Valuation Ranges"
Verifier: An automated judge that scores deliverables against rubrics. "an LLM-powered agentic verifier"

View Paper Prompt View All Prompts

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Generate Now

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Summary

Evaluating LLM Agents in Investment Banking: An Analysis of BankerToolBench

Motivation and Benchmark Overview

Benchmark Construction and Fidelity

Experimental Setup and Evaluation

Results: Performance and Failure Analysis

Task Difficulty, Domain Knowledge, and Sensitivity

Post-training, Reward Signal Quality, and Human Alignment

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A simple guide to “BankerToolBench: Evaluating AI Agents in End‑to‑End Investment Banking Workflows”

What is this paper about?

What questions were the researchers trying to answer?

How did they study it?

What did they find?

Why does this matter, and what’s the impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows

Summary

Evaluating LLM Agents in Investment Banking: An Analysis of BankerToolBench

Motivation and Benchmark Overview

Benchmark Construction and Fidelity

Experimental Setup and Evaluation

Results: Performance and Failure Analysis

Task Difficulty, Domain Knowledge, and Sensitivity

Post-training, Reward Signal Quality, and Human Alignment

Theoretical and Practical Implications

Conclusion

Paper to Video (Beta)

Whiteboard

Paper Prompts

Top Community Prompts

Explain it Like I'm 14

A simple guide to “BankerToolBench: Evaluating AI Agents in End‑to‑End Investment Banking Workflows”

What is this paper about?

What questions were the researchers trying to answer?

How did they study it?

What did they find?

Why does this matter, and what’s the impact?

Knowledge Gaps

Knowledge gaps, limitations, and open questions

Practical Applications

Immediate Applications

Long-Term Applications

Glossary

Open Problems

Continue Learning

Collections

Tweets

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research