MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Zhenting Wang, Qi Chang, Hemani Patel, Shashank Biju, Cheng-En Wu, Quan Liu, Aolin Ding, Alireza Rezazadeh, Ankit Shah, Yujia Bao, Eugene Siow

2025-08-29

MCP-Bench: Benchmarking Tool-Using LLM Agents with Complex Real-World Tasks via MCP Servers

Summary

This paper introduces MCP-Bench, a new way to test how well large language models (LLMs) can handle complex tasks that require them to use different tools and think through multiple steps to find solutions.

What's the problem?

Existing tests for LLMs often give them very specific instructions about *which* tools to use and only ask them to do a few simple things. This doesn't reflect how LLMs will be used in the real world, where they need to figure out the right tools themselves and combine them in clever ways to solve complicated problems. Current benchmarks also tend to focus on one area at a time, instead of testing how well LLMs can work across different areas like finance, travel, and research.

What's the solution?

The researchers created MCP-Bench, which connects LLMs to a network of 28 different 'servers' that control access to over 250 tools. These tools aren't just isolated; they're designed to work *together*. The LLMs are given tasks described in everyday language and have to figure out which tools to use, in what order, and how to interpret the results from each tool to ultimately complete the task. They also developed a way to evaluate the LLMs on how well they understand the tools, plan their actions, and finish the tasks.

Why it matters?

MCP-Bench provides a much more realistic and challenging test for LLMs. By showing that even advanced models struggle with these complex, multi-step tasks, the researchers highlight areas where LLMs still need to improve before they can be reliably used for real-world applications that require planning, reasoning, and coordinating multiple tools.

Abstract

We introduce MCP-Bench, a benchmark for evaluating large language models (LLMs) on realistic, multi-step tasks that demand tool use, cross-tool coordination, precise parameter control, and planning/reasoning for solving tasks. Built on the Model Context Protocol (MCP), MCP-Bench connects LLMs to 28 representative live MCP servers spanning 250 tools across domains such as finance, traveling, scientific computing, and academic search. Unlike prior API-based benchmarks, each MCP server provides a set of complementary tools designed to work together, enabling the construction of authentic, multi-step tasks with rich input-output coupling. Tasks in MCP-Bench test agents' ability to retrieve relevant tools from fuzzy instructions without explicit tool names, plan multi-hop execution trajectories for complex objectives, ground responses in intermediate tool outputs, and orchestrate cross-domain workflows - capabilities not adequately evaluated by existing benchmarks that rely on explicit tool specifications, shallow few-step workflows, and isolated domain operations. We propose a multi-faceted evaluation framework covering tool-level schema understanding and usage, trajectory-level planning, and task completion. Experiments on 20 advanced LLMs reveal persistent challenges in MCP-Bench. Code and data: https://github.com/Accenture/mcp-bench.

View Paper