Dies ist eine Übersichtsseite mit Metadaten zu dieser wissenschaftlichen Arbeit. Der vollständige Artikel ist beim Verlag verfügbar.
BankerToolBench: Evaluating AI Agents in End-to-End Investment Banking Workflows
0
Zitationen
27
Autoren
2026
Jahr
Abstract
Existing AI benchmarks lack the fidelity to assess economically meaningful progress on professional workflows. To evaluate frontier AI agents in a high-value, labor-intensive profession, we introduce BankerToolBench (BTB): an open-source benchmark of end-to-end analytical workflows routinely performed by junior investment bankers. To develop an ecologically valid benchmark grounded in representative work environments, we collaborated with 502 investment bankers from leading firms. BTB requires agents to execute senior banker requests by navigating data rooms, using industry tools (market data platform, SEC filings database), and generating multi-file deliverables--including Excel financial models, PowerPoint pitch decks, and PDF/Word reports. Completing a BTB task takes bankers up to 21 hours, underscoring the economic stakes of successfully delegating this work to AI. BTB enables automated evaluation of any LLM or agent, scoring deliverables against 100+ rubric criteria defined by veteran investment bankers to capture stakeholder utility. Testing 9 frontier models, we find that even the best-performing model (GPT-5.4) fails nearly half of the rubric criteria and bankers rate 0% of its outputs as client-ready. Our failure analysis reveals key obstacles (such as breakdowns in cross-artifact consistency) and improvement directions for agentic AI in high-stakes professional workflows.
Ähnliche Arbeiten
Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI
2019 · 8.593 Zit.
Stop explaining black box machine learning models for high stakes decisions and use interpretable models instead
2019 · 8.483 Zit.
High-performance medicine: the convergence of human and artificial intelligence
2018 · 8.003 Zit.
BioBERT: a pre-trained biomedical language representation model for biomedical text mining
2019 · 6.824 Zit.
Proceedings of the 19th International Joint Conference on Artificial Intelligence
2005 · 5.781 Zit.
Autoren
- Elaine Lau
- Markus Dücker
- Ronak Chaudhary
- Hui Wen Goh
- Rosemary Wei
- Vaibhav Kumar
- Saed Qunbar
- Guram Gogia
- Yi Liu
- Scott Millslagle
- Nasim Borazjanizadeh
- Ulyana Tkachenko
- Samuel Eshun Danquah
- Collin Schweiker
- Vijay Karumathil
- Asrith Devalaraju
- Varsha Sandadi
- Haemi Nam
- Punit Arani
- Ray Epps
- Abdullah Arif
- Sahil Bhaiwala
- Curtis Northcutt
- Skyler Wang
- Anish Athalye
- Jonas Mueller
- Francisco Guzmán