V-Thinker: Interactive Thinking with Images

Runqi Qiao, Qiuna Tan, Minghan Yang, Guanting Dong, Peiqing Yang, Shiqiang Lang, Enhui Wan, Xiaowan Wang, Yida Xu, Lan Yang, Chong Sun, Chen Li, Honggang Zhang

2025-11-07

V-Thinker: Interactive Thinking with Images

Summary

This paper introduces V-Thinker, a new system designed to help large AI models, called Large Multimodal Models (LMMs), think more effectively using images. It's about making these models not just *look* at images to help solve problems, but actually *interact* with them as part of their reasoning process.

What's the problem?

Current AI models are getting better at using images to help with tasks, but they're still limited. They often can only use a small set of tools to interact with images, and the way they approach problems is often designed for very specific tasks. This means they struggle with more complex, open-ended reasoning that requires them to really dig into the details of an image and explore different possibilities.

What's the solution?

The researchers created V-Thinker, which uses a two-part approach. First, it automatically creates lots of different practice problems involving images, making sure these problems are diverse, high-quality, and get progressively harder. Second, it trains the AI model using reinforcement learning – essentially rewarding it for making good decisions while interacting with images. This training happens in two stages, first focusing on understanding what's in the image, then on using that understanding to solve problems. They also created a new set of challenging tasks, called VTBench, to test how well these models perform.

Why it matters?

This work is important because it pushes AI models closer to being able to truly reason with images, similar to how humans do. By allowing models to interact with images and learn through practice, V-Thinker improves their ability to solve complex problems that require visual understanding and opens the door for more advanced applications in areas like robotics, image analysis, and problem-solving.

Abstract

Empowering Large Multimodal Models (LMMs) to deeply integrate image interaction with long-horizon reasoning capabilities remains a long-standing challenge in this field. Recent advances in vision-centric reasoning explore a promising "Thinking with Images" paradigm for LMMs, marking a shift from image-assisted reasoning to image-interactive thinking. While this milestone enables models to focus on fine-grained image regions, progress remains constrained by limited visual tool spaces and task-specific workflow designs. To bridge this gap, we present V-Thinker, a general-purpose multimodal reasoning assistant that enables interactive, vision-centric thinking through end-to-end reinforcement learning. V-Thinker comprises two key components: (1) a Data Evolution Flywheel that automatically synthesizes, evolves, and verifies interactive reasoning datasets across three dimensions-diversity, quality, and difficulty; and (2) a Visual Progressive Training Curriculum that first aligns perception via point-level supervision, then integrates interactive reasoning through a two-stage reinforcement learning framework. Furthermore, we introduce VTBench, an expert-verified benchmark targeting vision-centric interactive reasoning tasks. Extensive experiments demonstrate that V-Thinker consistently outperforms strong LMM-based baselines in both general and interactive reasoning scenarios, providing valuable insights for advancing image-interactive reasoning applications.

View Paper