Create a Video View Paper

BankerToolBench: Can AI Agents Really Do Investment Banking?

This presentation examines BankerToolBench, the first rigorous benchmark for evaluating AI agents on end-to-end investment banking workflows. Built with input from 502 bankers, BTB reveals that even frontier models like GPT-5.4 pass only 16% of realistic tasks at acceptable quality—and produce zero client-ready work. We explore why current agents fail on complex, multi-artifact deliverables requiring financial judgment, and what this means for AI deployment in high-stakes professional domains.

Script

Frontier language models can write essays, generate code, and pass the bar exam. But can they actually perform the complex, judgment-heavy work of an investment banker? A new benchmark reveals a sobering answer.

BankerToolBench is the first benchmark requiring agents to complete authentic investment banking tasks from end to end. These aren't simple question-answer pairs—they demand synthesizing unstructured requests, navigating real data rooms, using industry tools, and producing multiple interdependent deliverables like discounted cash flow models and pitch decks.

Each task is evaluated by around 150 pass-fail criteria weighted by criticality, curated by veteran bankers. An automated agent-as-judge verifier inspects deliverables using code execution—necessary because you can't reliably score Excel workbooks or PowerPoint presentations through simple text comparison.

Nine frontier models were tested, and the results are striking. GPT 5.4 achieves the highest score, passing roughly 16 percent of tasks at an acceptable standard—but fails nearly half of all rubric criteria. Not a single model produces output that bankers would consider client-ready.

The failure taxonomy reveals systemic brittleness. Models generate code with non-existent APIs, hard-code values instead of using formulas, misapply financial concepts, fail to retrieve information consistently, and fabricate data when retrieval fails. In long-horizon workflows, local errors cascade into global deliverable failure.

BankerToolBench makes one thing clear: progress on isolated capabilities doesn't guarantee performance on integrated, high-stakes workflows. Current agents aren't ready for full delegation in investment banking—or likely in any domain where judgment, consistency across artifacts, and genuine economic value matter. Explore the full benchmark and create your own analysis at EmergentMind.com.