M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Yang Zhou, Mingyu Zhao, Zhenting Wang, Difei Gu, Bangwei Guo, Ruosong Ye, Ligong Han, Can Jin, Dimitris N. Metaxas

2025-11-25

M3-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

Summary

This paper introduces a new way to test how well AI models can use different tools, like searching the internet or manipulating images, based on both visual and textual information. It's called M^3-Bench and aims to see if these models can handle complex tasks that require multiple steps and remembering information from earlier steps.

What's the problem?

Current AI models, specifically those that can understand both images and text, aren't very good at using tools in a realistic way. They struggle with tasks that require them to understand what's in an image, use that understanding to choose the right tool, and then remember what they did with one tool to help them use another. Existing tests don't accurately reflect these kinds of complex, real-world scenarios where tools depend on each other and information needs to be saved between steps.

What's the solution?

The researchers created M^3-Bench, a benchmark with 231 different tools across 28 servers. They developed a system to carefully track each step an AI model takes when using these tools, making sure each action is clearly linked to the model's reasoning. They also created a way to evaluate the models not just on whether they complete the task, but also on *how* well they use the tools and if their reasoning makes sense. Human reviewers and other AI models were used to verify the results and assess the quality of the AI's work.

Why it matters?

This work is important because it highlights the weaknesses of current AI models in a crucial area: using tools to solve complex problems. By providing a challenging and realistic benchmark, it pushes researchers to develop AI systems that can better understand the world around them, reason logically, and effectively utilize the tools available to them, ultimately leading to more capable and helpful AI assistants.

Abstract

We present M^3-Bench, the first benchmark for evaluating multimodal tool use under the Model Context Protocol. The benchmark targets realistic, multi-hop and multi-threaded workflows that require visual grounding and textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps. We introduce a similarity-driven alignment that serializes each tool call, embeds signatures with a sentence encoder, and performs similarity-bucketed Hungarian matching to obtain auditable one-to-one correspondences. On top of this alignment, we report interpretable metrics that decouple semantic fidelity from workflow consistency. The benchmark spans 28 servers with 231 tools, and provides standardized trajectories curated through an Executor & Judge pipeline with human verification; an auxiliary four large language models (LLMs) judge ensemble reports end-task Task Completion and information grounding. Evaluations of representative state-of-the-art Multimodal LLMs (MLLMs) reveal persistent gaps in multimodal MCP tool use, particularly in argument fidelity and structure consistency, underscoring the need for methods that jointly reason over images, text, and tool graphs. Our Benchmark's anonymous repository is at https://github.com/EtaYang10th/Open-M3-Bench

View Paper