MSC-Bench: A Rigorous Benchmark for Multi-Server Tool Orchestration
Jia-Kai Dong, I-Wei Huang, Chun-Tin Wu, Yi-Tien Tsai
2025-10-24
Summary
This paper introduces MSC-Bench, a new way to test how well large language models (LLMs) can use different tools together to solve complex problems, like a digital assistant planning a series of steps.
What's the problem?
Current methods for testing LLMs' ability to use tools aren't very realistic. They usually test each tool separately, which doesn't show how well they work *together*. This leads to an overly positive view of how capable these systems are because it ignores issues like tools doing similar things or needing to work across different computer systems. It's hard to know if an LLM is *actually* solving the problem or just getting lucky, and often relies on another LLM to judge the answer, which isn't ideal.
What's the solution?
The researchers created MSC-Bench, a benchmark with a set of problems designed to be solved using multiple tools in a specific order. They built a 'ground truth' – a known correct answer – by defining sets of tools that can all achieve the same function. This allows them to automatically score the LLM's performance using metrics like F1 score, instead of relying on another LLM to judge. The benchmark starts with simple tasks and gradually gets harder, testing the LLM's ability to plan complex actions and handle requests that are outside of its normal capabilities.
Why it matters?
MSC-Bench is important because it provides a more accurate and reliable way to evaluate LLM agents. By revealing weaknesses in current systems, especially in handling complex tasks and unexpected situations, it helps researchers build better, more robust, and more efficient AI assistants that can truly orchestrate tools to solve real-world problems.
Abstract
We introduce MSC-Bench, a large-scale benchmark for evaluating multi-hop, end-to-end tool orchestration by LLM agents in a hierarchical Model-Context Protocol (MCP) ecosystem. Existing benchmarks often evaluate tools in isolation, ignoring challenges such as functional overlap and cross-server orchestration, leading to overly optimistic assessments. MSC-Bench addresses these gaps by constructing ground truth through 'equal function sets', allowing objective metrics such as F1 score and reducing the dependency on LLM-as-a-judge evaluation. Organized as a five-level curriculum, it systematically tests agent capabilities from single-tool orchestration to complex cross-server planning, and robustness to out-of-scope requests. Experiments reveal that rigid hierarchies can hinder performance without co-designed strategies, and even state-of-the-art agents exhibit systemic weaknesses in robustness. MSC-Bench provides a diagnostic framework to expose these limitations and guide the development of more capable and efficient tool-using agents. The benchmark and resources are publicly available at https://github.com/snooow1029/MSC_Bench.