MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Shilong Li, Xingyuan Bu, Wenjie Wang, Jiaheng Liu, Jun Dong, Haoyang He, Hao Lu, Haozhe Zhang, Chenchen Jing, Zhen Li, Chuanhao Li, Jiayi Tian, Chenchen Zhang, Tianhao Peng, Yancheng He, Jihao Gu, Yuanxing Zhang, Jian Yang, Ge Zhang, Wenhao Huang, Wangchunshu Zhou, Zhaoxiang Zhang

2025-08-20

MM-BrowseComp: A Comprehensive Benchmark for Multimodal Browsing Agents

Summary

This paper introduces a new way to test AI agents that browse the web, focusing on how well they can understand information that isn't just text, like pictures and videos. It shows that current AI models aren't very good at this multimodal stuff.

What's the problem?

Existing tests for AI web browsing agents mainly look at their ability to understand text. However, a lot of important information on the internet is presented in images and videos, which current AI agents often struggle to interpret and use effectively during their searches. This means we don't really know how well they can handle real-world, multimodal web content.

What's the solution?

The researchers created a new test called MM-BrowseComp, which includes 224 complex questions designed specifically to challenge AI agents' ability to find and reason with information found in images and videos on webpages. They also created a system to check the AI's answers for each question, allowing for a detailed analysis of how they used multimodal information. They tested current AI models and found that even the best ones performed poorly, confirming the limitations in their multimodal reasoning abilities.

Why it matters?

This work is important because it highlights a major weakness in current AI agents that are supposed to interact with the real world through the internet. By showing that they struggle with visual and video information, it points out the need for AI to become much better at understanding content in all its forms, not just text, to be truly useful for deep web searches and other tasks.

Abstract

AI agents with advanced reasoning and tool use capabilities have demonstrated impressive performance in web browsing for deep search. While existing benchmarks such as BrowseComp evaluate these browsing abilities, they primarily focus on textual information, overlooking the prevalence of multimodal content. To bridge this gap, we introduce MM-BrowseComp, a novel benchmark comprising 224 challenging, hand-crafted questions specifically designed to assess agents' multimodal retrieval and reasoning capabilities. These questions often incorporate images in prompts, and crucial information encountered during the search and reasoning process may also be embedded within images or videos on webpages. Consequently, methods relying solely on text prove insufficient for our benchmark. Additionally, we provide a verified checklist for each question, enabling fine-grained analysis of multimodal dependencies and reasoning paths. Our comprehensive evaluation of state-of-the-art models on MM-BrowseComp reveals that even top models like OpenAI o3 with tools achieve only 29.02\% accuracy, highlighting the suboptimal multimodal capabilities and lack of native multimodal reasoning in current models.

View Paper