FinToolBench: Evaluating LLM Agents for Real-World Financial Tool Use
Jiaxuan Lu, Kong Wang, Yemin Wang, Qingmei Tang, Hongwei Zeng, Xiang Chen, Jiahao Pi, Shujian Deng, Lingzhi Chen, Yi Fu, Kehua Yang, Xiao Sun
2026-03-18
Summary
This paper introduces a new way to test how well AI models, specifically Large Language Models, can actually *do* things in the world of finance, not just talk about them. It's about moving beyond simply finding information to having AI agents that can actively use financial tools to solve problems.
What's the problem?
Currently, testing AI in finance is lacking. Existing tests either just analyze text or ask questions about documents, which doesn't reflect real-world financial tasks that require actually *using* tools like trading platforms or data analysis software. General AI tests aren't specific enough for the strict rules and fast-changing data in finance. There wasn't a good, realistic way to see if an AI could reliably and safely handle financial tasks.
What's the solution?
The researchers created FinToolBench, a comprehensive testing environment with 760 real financial tools and 295 challenging questions that require using those tools. They didn't just check if the AI got the right answer, but also how quickly it responded, what kind of financial task it was performing, and if it followed financial regulations. They also built a baseline AI model, FATR, to show how these tools could be used effectively and safely. Everything they created is being made publicly available for others to use and build upon.
Why it matters?
This work is important because it sets a new standard for building trustworthy AI in finance. By providing a realistic and rigorous testing ground, it helps ensure that AI systems are not only accurate but also reliable, compliant with regulations, and able to handle the complexities of the financial world. This is crucial for safely integrating AI into important financial processes.
Abstract
The integration of Large Language Models (LLMs) into the financial domain is driving a paradigm shift from passive information retrieval to dynamic, agentic interaction. While general-purpose tool learning has witnessed a surge in benchmarks, the financial sector, characterized by high stakes, strict compliance, and rapid data volatility, remains critically underserved. Existing financial evaluations predominantly focus on static textual analysis or document-based QA, ignoring the complex reality of tool execution. Conversely, general tool benchmarks lack the domain-specific rigor required for finance, often relying on toy environments or a negligible number of financial APIs. To bridge this gap, we introduce FinToolBench, the first real-world, runnable benchmark dedicated to evaluating financial tool learning agents. Unlike prior works limited to a handful of mock tools, FinToolBench establishes a realistic ecosystem coupling 760 executable financial tools with 295 rigorous, tool-required queries. We propose a novel evaluation framework that goes beyond binary execution success, assessing agents on finance-critical dimensions: timeliness, intent type, and regulatory domain alignment. Furthermore, we present FATR, a finance-aware tool retrieval and reasoning baseline that enhances stability and compliance. By providing the first testbed for auditable, agentic financial execution, FinToolBench sets a new standard for trustworthy AI in finance. The tool manifest, execution environment, and evaluation code will be open-sourced to facilitate future research.