DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Kartik Narayan, Yang Xu, Tian Cao, Kavya Nerella, Vishal M. Patel, Navid Shiee, Peter Grasch, Chao Jia, Yinfei Yang, Zhe Gan

2025-10-15

DeepMMSearch-R1: Empowering Multimodal LLMs in Multimodal Web Search

Summary

This paper introduces DeepMMSearch-R1, a new type of AI model that combines understanding of images and text with the ability to search the web to answer complex questions. It's designed to be better at finding information and adapting to new information than existing models.

What's the problem?

Current AI models that try to use information from the internet, like those using a technique called RAG, often struggle because they follow a fixed process, make too many web searches, or don't formulate their search requests very well. This leads to wasted time and inaccurate answers. They aren't very good at figuring out *when* to search, *what* to search for, or how to improve their searches based on what they find.

What's the solution?

The researchers created DeepMMSearch-R1, which can decide on its own when to search the web, and it can use both text *and* parts of images to guide its searches. It doesn't just make one search; it can refine its search terms based on the results it gets, almost like thinking through the problem step-by-step. They trained this model in two phases: first, they showed it lots of examples, and then they let it learn through trial and error using a new dataset called DeepMMSearchVQA, which includes questions requiring both visual and textual understanding.

Why it matters?

This work is important because it moves AI closer to being able to handle real-world questions that require up-to-date knowledge and understanding of both images and text. By making web searching more efficient and intelligent, this research could lead to AI assistants that are much more helpful and reliable when dealing with complex information needs.

Abstract

Multimodal Large Language Models (MLLMs) in real-world applications require access to external knowledge sources and must remain responsive to the dynamic and ever-changing real-world information in order to address information-seeking and knowledge-intensive user queries. Existing approaches, such as retrieval augmented generation (RAG) methods, search agents, and search equipped MLLMs, often suffer from rigid pipelines, excessive search calls, and poorly constructed search queries, which result in inefficiencies and suboptimal outcomes. To address these limitations, we present DeepMMSearch-R1, the first multimodal LLM capable of performing on-demand, multi-turn web searches and dynamically crafting queries for both image and text search tools. Specifically, DeepMMSearch-R1 can initiate web searches based on relevant crops of the input image making the image search more effective, and can iteratively adapt text search queries based on retrieved information, thereby enabling self-reflection and self-correction. Our approach relies on a two-stage training pipeline: a cold start supervised finetuning phase followed by an online reinforcement learning optimization. For training, we introduce DeepMMSearchVQA, a novel multimodal VQA dataset created through an automated pipeline intermixed with real-world information from web search tools. This dataset contains diverse, multi-hop queries that integrate textual and visual information, teaching the model when to search, what to search for, which search tool to use and how to reason over the retrieved information. We conduct extensive experiments across a range of knowledge-intensive benchmarks to demonstrate the superiority of our approach. Finally, we analyze the results and provide insights that are valuable for advancing multimodal web-search.

View Paper