Towards Long-horizon Agentic Multimodal Search

Yifan Du, Zikang Liu, Jinbiao Peng, Jie Wu, Junyi Li, Jinyang Li, Wayne Xin Zhao, Ji-Rong Wen

2026-04-15

Towards Long-horizon Agentic Multimodal Search

Summary

This paper introduces a new system, LMM-Searcher, designed to help AI agents perform complex tasks that require both looking at images and reading text over a long period of time, like researching a topic online.

What's the problem?

Current AI agents struggle when they need to use both images and text for extended tasks. Processing all that information at once becomes overwhelming, leading to either the agent forgetting important details or becoming too slow because of the sheer amount of data it has to handle. Essentially, they have trouble keeping track of everything they've seen and read over many steps.

What's the solution?

The researchers tackled this by creating a system that doesn't need to hold onto all the images directly. Instead, it stores the images in a separate 'file system' and uses short, unique codes to refer to them. This keeps the main processing load manageable. They also gave the agent a tool to specifically request and load images only when they're needed, rather than all at once. Finally, they created a large set of challenging tasks to train a powerful AI model, Qwen3-VL-Thinking-30A3B, to use this system effectively.

Why it matters?

This work is important because it allows AI agents to handle much more complex, real-world tasks that require understanding both visual and textual information over extended interactions. It improves performance on existing benchmarks and shows the potential for building more capable and adaptable AI systems that can perform tasks like in-depth research or complex problem-solving.

Abstract

Multimodal deep search agents have shown great potential in solving complex tasks by iteratively collecting textual and visual evidence. However, managing the heterogeneous information and high token costs associated with multimodal inputs over long horizons remains a critical challenge, as existing methods often suffer from context explosion or the loss of crucial visual signals. To address this, we propose a novel Long-horizon MultiModal deep search framework, named LMM-Searcher, centered on a file-based visual representation mechanism. By offloading visual assets to an external file system and mapping them to lightweight textual identifiers (UIDs), our approach mitigates context overhead while preserving multimodal information for future access. We equip the agent with a tailored fetch-image tool, enabling a progressive, on-demand visual loading strategy for active perception. Furthermore, we introduce a data synthesis pipeline designed to generate queries requiring complex cross-modal multi-hop reasoning. Using this pipeline, we distill 12K high-quality trajectories to fine-tune Qwen3-VL-Thinking-30A3B into a specialized multimodal deep search agent. Extensive experiments across four benchmarks demonstrate that our method successfully scales to 100-turn search horizons, achieving state-of-the-art performance among open-source models on challenging long-horizon benchmarks like MM-BrowseComp and MMSearch-Plus, while also exhibiting strong generalizability across different base models. Our code will be released in https://github.com/RUCAIBox/LMM-Searcher.

View Paper