LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Ming Yin, Dinghan Shen, Silei Xu, Jianbing Han, Sixun Dong, Mian Zhang, Yebowen Hu, Shujian Liu, Simin Ma, Song Wang, Sathish Reddy Indurthi, Xun Wang, Yiran Chen, Kaiqiang Song

2025-08-22

LiveMCP-101: Stress Testing and Diagnosing MCP-enabled Agents on Challenging Queries

Summary

This paper introduces a new way to test how well AI agents can use different tools to complete complicated tasks in a realistic setting, focusing on a standard method for AI tools to work together.

What's the problem?

Currently, there isn't a good benchmark to truly measure how well AI agents can handle tasks that require using multiple tools in a sequence, especially when those tasks are complex and change over time. Existing tests often just look at whether the AI gets the final answer right, not *how* it used the tools to get there, and don't account for real-world situations where things aren't always predictable.

What's the solution?

The researchers created a benchmark called LiveMCP-101, which includes 101 real-world questions that require AI agents to use several tools like web search, file management, math, and data analysis. Importantly, they don't just check the final answer; they compare the AI's *plan* for solving the problem to a correct plan, which is a more reliable way to assess its reasoning. They then tested some of the most advanced AI models on this benchmark.

Why it matters?

This work is important because it provides a more challenging and realistic test for AI agents. The results show that even the best AI models still struggle with coordinating tools effectively, highlighting areas where further improvement is needed to build truly autonomous AI systems that can reliably solve complex problems.

Abstract

Tool calling has emerged as a critical capability for AI agents to interact with the real world and solve complex tasks. While the Model Context Protocol (MCP) provides a powerful standardized framework for tool integration, there is a significant gap in benchmarking how well AI agents can effectively solve multi-step tasks using diverse MCP tools in realistic, dynamic scenarios. In this work, we present LiveMCP-101, a benchmark of 101 carefully curated real-world queries, refined through iterative LLM rewriting and manual review, that require coordinated use of multiple MCP tools including web search, file operations, mathematical reasoning, and data analysis. Moreover, we introduce a novel evaluation approach that leverages ground-truth execution plans rather than raw API outputs, better reflecting the evolving nature of real-world environments. Experiments show that even frontier LLMs achieve a success rate below 60\%, highlighting major challenges in tool orchestration. Detailed ablations and error analysis further reveal distinct failure modes and inefficiencies in token usage, pointing to concrete directions for advancing current models. LiveMCP-101 sets a rigorous standard for evaluating real-world agent capabilities, advancing toward autonomous AI systems that reliably execute complex tasks through tool use.

View Paper