InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Kaican Li, Lewei Yao, Jiannan Wu, Tiezheng Yu, Jierun Chen, Haoli Bai, Lu Hou, Lanqing Hong, Wei Zhang, Nevin L. Zhang

2025-12-29

InSight-o3: Empowering Multimodal Foundation Models with Generalized Visual Search

Summary

This paper focuses on improving how well AI can understand and reason about images, moving beyond simply 'seeing' to actually 'thinking with images'. It introduces a new way to test these abilities and a new system to help AI perform better on complex visual tasks.

What's the problem?

Current AI agents struggle with tasks that require detailed reasoning based on images, especially when information is spread across different parts of the image. Think about analyzing a complicated chart or following directions on a map – these require piecing together visual clues and understanding relationships between them. Existing AI systems aren't very good at this kind of multi-step visual reasoning, even the most advanced ones.

What's the solution?

The researchers created a challenging benchmark called O3-Bench to specifically test this visual reasoning ability. They then developed a system called InSight-o3, which uses two AI 'agents' working together: one that focuses on reasoning (vReasoner) and another that's really good at finding specific things within an image based on descriptions (vSearcher). The vSearcher is particularly good at finding things that aren't just simple objects, but also concepts or relationships described in words. They trained this vSearcher using a special technique called reinforcement learning, and it can be added to existing AI systems to make them better at visual tasks.

Why it matters?

This work is important because it represents a step towards creating AI that can truly understand and interact with the real world. Being able to reason about images is crucial for many applications, like helping people with visual impairments, automating document analysis, or building robots that can navigate complex environments. By identifying the weaknesses in current AI and providing a new system to address them, this research paves the way for more powerful and versatile AI systems.

Abstract

The ability for AI agents to "think with images" requires a sophisticated blend of reasoning and perception. However, current open multimodal agents still largely fall short on the reasoning aspect crucial for real-world tasks like analyzing documents with dense charts/diagrams and navigating maps. To address this gap, we introduce O3-Bench, a new benchmark designed to evaluate multimodal reasoning with interleaved attention to visual details. O3-Bench features challenging problems that require agents to piece together subtle visual information from distinct image areas through multi-step reasoning. The problems are highly challenging even for frontier systems like OpenAI o3, which only obtains 40.8% accuracy on O3-Bench. To make progress, we propose InSight-o3, a multi-agent framework consisting of a visual reasoning agent (vReasoner) and a visual search agent (vSearcher) for which we introduce the task of generalized visual search -- locating relational, fuzzy, or conceptual regions described in free-form language, beyond just simple objects or figures in natural images. We then present a multimodal LLM purpose-trained for this task via reinforcement learning. As a plug-and-play agent, our vSearcher empowers frontier multimodal models (as vReasoners), significantly improving their performance on a wide range of benchmarks. This marks a concrete step towards powerful o3-like open systems. Our code and dataset can be found at https://github.com/m-Just/InSight-o3 .

View Paper