MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, Zhendong Mao

2025-09-15

MCP-AgentBench: Evaluating Real-World Language Agent Performance with MCP-Mediated Tools

Summary

This paper introduces a new way to test how well AI agents can use tools, specifically focusing on a new standard called the Model Context Protocol (MCP). It's about creating a better system for measuring if these agents are actually good at getting things done in the real world.

What's the problem?

Currently, the tests used to evaluate AI agents don't accurately reflect how well they perform when interacting with tools using the MCP standard. These existing tests give a misleading idea of an agent’s abilities and make it hard to tell which agents are truly better at using tools to solve problems. Basically, we need a more realistic and challenging way to assess these agents.

What's the solution?

The researchers created MCP-AgentBench, a comprehensive testing environment with 33 servers running 188 different tools. They designed 600 questions that require agents to use these tools in complex ways. They also developed a new evaluation method, MCP-Eval, that focuses on whether the agent actually *succeeds* in completing the task, rather than just how it tries to do it. They then tested several leading AI agents using this new benchmark.

Why it matters?

This work is important because it provides a standardized and reliable way to measure the performance of AI agents using the MCP standard. This will help researchers build better agents, validate improvements, and ultimately accelerate the development of AI systems that can truly interact with the world and be useful in practical applications. It’s a step towards making AI more capable and interoperable.

Abstract

The Model Context Protocol (MCP) is rapidly emerging as a pivotal open standard, designed to enhance agent-tool integration and interoperability, and is positioned to unlock a new era of powerful, interconnected, and genuinely utilitarian agentic AI. However, despite MCP's growing adoption, existing benchmarks often fail to capture real-world agent performance within this new paradigm, leading to a distorted perception of their true operational value and an inability to reliably differentiate proficiencies. To bridge this critical evaluation gap, we introduce MCP-AgentBench -- a comprehensive benchmark specifically engineered to rigorously assess language agent capabilities in MCP-mediated tool interactions. Core contributions of MCP-AgentBench include: the establishment of a robust MCP testbed comprising 33 operational servers with 188 distinct tools; the development of a benchmark featuring 600 systematically designed queries distributed across 6 distinct categories of varying interaction complexity; and the introduction of MCP-Eval, a novel outcome-oriented evaluation methodology prioritizing real-world task success. Through extensive empirical evaluation of leading language agents, we provide foundational insights. MCP-AgentBench aims to equip the research community with a standardized and reliable framework to build, validate, and advance agents capable of fully leveraging MCP's transformative benefits, thereby accelerating progress toward truly capable and interoperable AI systems.

View Paper