Retrieval Models Aren't Tool-Savvy: Benchmarking Tool Retrieval for Large Language Models
Zhengliang Shi, Yuhan Wang, Lingyong Yan, Pengjie Ren, Shuaiqiang Wang, Dawei Yin, Zhaochun Ren
2025-03-06
Summary
This paper talks about ToolRet, a new way to test how well AI systems can find the right tools to solve problems, and how current methods for finding information aren't very good at finding the right tools
What's the problem?
AI language models are getting better at using tools to solve tasks, but they can only work with a limited amount of information at once. To use tools effectively, they need to be able to quickly find the right tools from a large collection. Current ways of testing this don't reflect real-world situations, and it's not clear how well existing search methods work for finding tools
What's the solution?
The researchers created ToolRet, a large collection of 7,600 tasks for finding tools, along with a database of 43,000 tools. They tested six different types of search models on ToolRet and found that even the best ones weren't very good at finding the right tools. To help improve this, they also created a training dataset with over 200,000 examples, which helped make the search models better at finding tools
Why it matters?
This matters because as AI systems become more advanced, they need to be able to use tools effectively to solve real-world problems. By showing that current search methods aren't good enough for finding the right tools, and providing ways to improve them, this research could lead to AI systems that are much better at solving complex tasks by using the right tools at the right time
Abstract
Tool learning aims to augment large language models (LLMs) with diverse tools, enabling them to act as agents for solving practical tasks. Due to the limited context length of tool-using LLMs, adopting information retrieval (IR) models to select useful tools from large toolsets is a critical initial step. However, the performance of IR models in tool retrieval tasks remains underexplored and unclear. Most tool-use benchmarks simplify this step by manually pre-annotating a small set of relevant tools for each task, which is far from the real-world scenarios. In this paper, we propose ToolRet, a heterogeneous tool retrieval benchmark comprising 7.6k diverse retrieval tasks, and a corpus of 43k tools, collected from existing datasets. We benchmark six types of models on ToolRet. Surprisingly, even the models with strong performance in conventional IR benchmarks, exhibit poor performance on ToolRet. This low retrieval quality degrades the task pass rate of tool-use LLMs. As a further step, we contribute a large-scale training dataset with over 200k instances, which substantially optimizes the tool retrieval ability of IR models.