The Tool Decathlon: Benchmarking Language Agents for Diverse, Realistic, and Long-Horizon Task Execution
Junlong Li, Wenshuo Zhao, Jian Zhao, Weihao Zeng, Haoze Wu, Xiaochen Wang, Rui Ge, Yuxuan Cao, Yuzhen Huang, Wei Liu, Junteng Liu, Zhaochen Su, Yiyang Guo, Fan Zhou, Lueyang Zhang, Juan Michelini, Xingyao Wang, Xiang Yue, Shuyan Zhou, Graham Neubig, Junxian He
2025-10-30
Summary
This paper introduces a new way to test how well computer programs, called 'language agents,' can handle complicated tasks that require using many different software applications, like email, calendars, and databases.
What's the problem?
Current tests for these agents are too simple and don't accurately reflect the challenges of real-world situations. They often focus on just one application or a very short series of steps, making it hard to tell if an agent can truly manage complex, multi-step workflows that require interacting with a variety of tools and dealing with realistic starting conditions.
What's the solution?
The researchers created a benchmark called 'Toolathlon' which includes 32 different applications and over 600 tools. These range from common apps like Google Calendar to professional tools like Kubernetes. Importantly, Toolathlon doesn't just check if the agent *can* use the tools, but also tests how well it performs with realistic, varied starting situations – like a Canvas course already filled with students or a pre-existing spreadsheet. The benchmark includes 108 tasks that require around 20 steps to complete and are automatically checked for correctness.
Why it matters?
This new benchmark is important because it shows that even the best current language agents still struggle with these complex tasks, achieving only around a 38% success rate. By providing a more challenging and realistic test, Toolathlon will encourage developers to create more capable agents that can actually be useful in real-world applications.
Abstract
Real-world language agents must handle complex, multi-step workflows across diverse Apps. For instance, an agent may manage emails by coordinating with calendars and file systems, or monitor a production database to detect anomalies and generate reports following an operating manual. However, existing language agent benchmarks often focus on narrow domains or simplified tasks that lack the diversity, realism, and long-horizon complexity required to evaluate agents' real-world performance. To address this gap, we introduce the Tool Decathlon (dubbed as Toolathlon), a benchmark for language agents offering diverse Apps and tools, realistic environment setup, and reliable execution-based evaluation. Toolathlon spans 32 software applications and 604 tools, ranging from everyday platforms such as Google Calendar and Notion to professional ones like WooCommerce, Kubernetes, and BigQuery. Most of the tools are based on a high-quality set of Model Context Protocol (MCP) servers that we may have revised or implemented ourselves. Unlike prior works, which primarily ensure functional realism but offer limited environment state diversity, we provide realistic initial environment states from real software, such as Canvas courses with dozens of students or real financial spreadsheets. This benchmark includes 108 manually sourced or crafted tasks in total, requiring interacting with multiple Apps over around 20 turns on average to complete. Each task is strictly verifiable through dedicated evaluation scripts. Comprehensive evaluation of SOTA models highlights their significant shortcomings: the best-performing model, Claude-4.5-Sonnet, achieves only a 38.6% success rate with 20.2 tool calling turns on average, while the top open-weights model DeepSeek-V3.2-Exp reaches 20.1%. We expect Toolathlon to drive the development of more capable language agents for real-world, long-horizon task execution.