DeepEyesV2: Toward Agentic Multimodal Model

Jack Hong, Chenxiao Zhao, ChengLin Zhu, Weiheng Lu, Guohai Xu, Xing Yu

2025-11-10

DeepEyesV2: Toward Agentic Multimodal Model

Summary

This research focuses on creating AI models that can understand both images and text, but also actively *do* things – like run code or search the internet – to help them solve problems. The paper introduces a new model called DeepEyesV2 and explores the best ways to build these 'agentic' multimodal models.

What's the problem?

Simply teaching these models through trial and error (reinforcement learning) wasn't enough to get them to reliably use tools when they needed to. The models struggled to figure out *when* and *how* to use tools like a calculator or a web search to improve their answers. It's like trying to learn to cook just by tasting dishes and guessing what ingredients were used – you need some initial guidance.

What's the solution?

The researchers used a two-step approach. First, they 'cold-started' the model by giving it examples specifically designed to show it how to use tools. This established basic tool-use patterns. Then, they used reinforcement learning to refine those skills and allow the model to learn more complex combinations of tools and decide which tools to use based on the specific situation. They also created a new, challenging test called RealX-Bench to really push the model's abilities.

Why it matters?

This work is important because it provides a roadmap for building more capable AI systems. These systems won't just passively receive information; they'll actively seek it out and use tools to solve problems, making them much more useful in real-world scenarios. The research also highlights the importance of carefully designing training data and using a staged training process to get the best results.

Abstract

Agentic multimodal models should not only comprehend text and images, but also actively invoke external tools, such as code execution environments and web search, and integrate these operations into reasoning. In this work, we introduce DeepEyesV2 and explore how to build an agentic multimodal model from the perspectives of data construction, training methods, and model evaluation. We observe that direct reinforcement learning alone fails to induce robust tool-use behavior. This phenomenon motivates a two-stage training pipeline: a cold-start stage to establish tool-use patterns, and reinforcement learning stage to further refine tool invocation. We curate a diverse, moderately challenging training dataset, specifically including examples where tool use is beneficial. We further introduce RealX-Bench, a comprehensive benchmark designed to evaluate real-world multimodal reasoning, which inherently requires the integration of multiple capabilities, including perception, search, and reasoning. We evaluate DeepEyesV2 on RealX-Bench and other representative benchmarks, demonstrating its effectiveness across real-world understanding, mathematical reasoning, and search-intensive tasks. Moreover, DeepEyesV2 exhibits task-adaptive tool invocation, tending to use image operations for perception tasks and numerical computations for reasoning tasks. Reinforcement learning further enables complex tool combinations and allows model to selectively invoke tools based on context. We hope our study can provide guidance for community in developing agentic multimodal models.

View Paper