Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
Qianshan Wei, Yishan Yang, Siyi Wang, Jinglin Chen, Binyu Wang, Jiaming Wang, Shuang Chen, Zechen Li, Yang Shi, Yuqi Tang, Weining Wang, Yi Yu, Chaoyou Fu, Qi Li, Yi-Fan Zhang
2026-04-06
Summary
This paper introduces a new way to test how well AI models that can 'see' and 'search' the internet can actually solve complex, real-world problems. These models, called Multimodal Large Language Models, are becoming more like agents that can take actions, but current tests aren't good at checking *how* they solve problems, only *if* they get the right answer.
What's the problem?
Existing tests for these AI models don't really check if the models are using their tools—like image recognition or web search—correctly or efficiently. They just look at the final answer. This means we don't know if the AI is actually thinking through the problem, or just getting lucky. It's like giving someone a calculator and only checking if the final math problem is right, not if they used the calculator properly or even needed it at all.
What's the solution?
The researchers created a benchmark called Agentic-MME. It includes over 400 tasks across different areas, with varying levels of difficulty. What makes it different is that it doesn't just check the final answer; it looks at each step the AI takes, verifying if the tools were used correctly and efficiently. They even compared the AI's steps to how a human would solve the same problem, measuring how much 'overthinking' the AI did. They built a system that allows them to safely run code and access APIs as part of the testing process.
Why it matters?
This work is important because it provides a more reliable way to evaluate these powerful AI models. Knowing *how* an AI solves a problem is crucial for building trust and ensuring they're used responsibly. The results show that even the best models, like Gemini3-pro, still struggle with complex tasks, highlighting areas where further improvement is needed to make these AI agents truly helpful in the real world.
Abstract
Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.