DeepImageSearch: Benchmarking Multimodal Agents for Context-Aware Image Retrieval in Visual Histories
Chenlong Deng, Mengjie Deng, Junjie Wu, Dun Zeng, Teng Wang, Qingsong Xie, Jiadeng Huang, Shengjie Ma, Changwang Zhang, Zhaoxiang Wang, Jun Wang, Yutao Zhu, Zhicheng Dou
2026-02-17
Summary
This paper introduces a new way to search for images, moving beyond simply matching what's *in* an image to understanding the context around a series of images over time.
What's the problem?
Current image search systems treat each image as independent, ignoring that real-world visual information often unfolds as a sequence. Imagine trying to find a specific moment in a video – you need to understand what happened *before* and *after* to pinpoint it. Existing systems aren't good at this kind of 'visual storytelling' or understanding how images relate to each other in a timeline.
What's the solution?
The researchers developed a system called DeepImageSearch that acts like an 'agent' exploring a visual history. Instead of just getting a list of images, the system actively plans a series of steps to find the target image, using clues from the surrounding images. They also created a new dataset, DISBench, specifically designed to test this ability. To make creating these complex searches easier, they combined the power of AI language models with human input to automatically discover connections between images and then have humans verify those connections.
Why it matters?
This work is important because it shows that future image search systems need to be more intelligent and understand context, not just recognize objects. It highlights the need for systems that can 'reason' through visual information like humans do, which is crucial for applications like video analysis, robotics, and more complex visual tasks.
Abstract
Existing multimodal retrieval systems excel at semantic matching but implicitly assume that query-image relevance can be measured in isolation. This paradigm overlooks the rich dependencies inherent in realistic visual streams, where information is distributed across temporal sequences rather than confined to single snapshots. To bridge this gap, we introduce DeepImageSearch, a novel agentic paradigm that reformulates image retrieval as an autonomous exploration task. Models must plan and perform multi-step reasoning over raw visual histories to locate targets based on implicit contextual cues. We construct DISBench, a challenging benchmark built on interconnected visual data. To address the scalability challenge of creating context-dependent queries, we propose a human-model collaborative pipeline that employs vision-language models to mine latent spatiotemporal associations, effectively offloading intensive context discovery before human verification. Furthermore, we build a robust baseline using a modular agent framework equipped with fine-grained tools and a dual-memory system for long-horizon navigation. Extensive experiments demonstrate that DISBench poses significant challenges to state-of-the-art models, highlighting the necessity of incorporating agentic reasoning into next-generation retrieval systems.