TheMCPCompany: Creating General-purpose Agents with Task-specific Tools
Reza Esfandiarpoor, Vishwas Suryanarayanan, Stephen H. Bach, Vishal Chowdhary, Anthony Aue
2025-10-23
Summary
This paper investigates how well large language models (LLMs) can use specialized tools, instead of just relying on web browsers, to complete tasks that involve interacting with real-world services.
What's the problem?
LLMs are getting better at using tools, but most current systems still depend heavily on web browsing to get things done. While there are many specialized tools available, it's difficult for LLMs to choose the right ones and combine them effectively, especially when there are a lot of options. It's also unclear if smaller LLMs can even take advantage of these tools, and how close current models are to performing perfectly with the right tools.
What's the solution?
The researchers created a benchmark called TheMCPCompany, which includes over 18,000 tools based on real-world service APIs. They also created 'ground truth' – essentially, the perfect tool choices for specific tasks – to see how well LLMs could perform if they always had the right tool. They then tested different LLMs, including GPT-5, to see how well they could *find* the right tools themselves. They compared the performance of models using tool retrieval to those using web browsers.
Why it matters?
This work shows that while the most advanced LLMs like GPT-5 are getting good at finding the right tools in simpler situations, they still struggle with complex environments that have many tools. It highlights that building systems that can effectively use a large number of specialized tools requires improvements in both the LLM’s reasoning abilities and its ability to search for and select the correct tools.
Abstract
Since the introduction of the Model Context Protocol (MCP), the number of available tools for Large Language Models (LLMs) has increased significantly. These task-specific tool sets offer an alternative to general-purpose tools such as web browsers, while being easier to develop and maintain than GUIs. However, current general-purpose agents predominantly rely on web browsers for interacting with the environment. Here, we introduce TheMCPCompany, a benchmark for evaluating tool-calling agents on tasks that involve interacting with various real-world services. We use the REST APIs of these services to create MCP servers, which include over 18,000 tools. We also provide manually annotated ground-truth tools for each task. In our experiments, we use the ground truth tools to show the potential of tool-calling agents for both improving performance and reducing costs assuming perfect tool retrieval. Next, we explore agent performance using tool retrieval to study the real-world practicality of tool-based agents. While all models with tool retrieval perform similarly or better than browser-based agents, smaller models cannot take full advantage of the available tools through retrieval. On the other hand, GPT-5's performance with tool retrieval is very close to its performance with ground-truth tools. Overall, our work shows that the most advanced reasoning models are effective at discovering tools in simpler environments, but seriously struggle with navigating complex enterprise environments. TheMCPCompany reveals that navigating tens of thousands of tools and combining them in non-trivial ways to solve complex problems is still a challenging task for current models and requires both better reasoning and better retrieval models.