Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Shijian Wang, Jiarui Jin, Xingjian Wang, Linxin Song, Runhao Fu, Hecheng Wang, Zongyuan Ge, Yuan Lu, Xuelian Cheng

2025-10-30

Video-Thinker: Sparking "Thinking with Videos" via Reinforcement Learning

Summary

This paper introduces a new method called Video-Thinker that allows AI models to reason about videos more effectively, similar to how they currently reason about images.

What's the problem?

Current AI models, specifically Multimodal Large Language Models (MLLMs), are really good at understanding images and using them to solve problems. However, they struggle to do the same with videos. They haven't figured out how to 'think through' a video and use the information within it to answer questions or make decisions. Essentially, they can 'see' the video but can't really 'understand' what's happening over time.

What's the solution?

The researchers created Video-Thinker, which lets these AI models use their existing abilities to describe what's happening in a video (captioning) and pinpoint specific parts of the video that are important (grounding). The model does this automatically, step-by-step, to build up a line of reasoning. They also created a new dataset, Video-Thinker-10K, to train the AI. The training process involves first teaching the model the correct reasoning format and then strengthening its ability to reason effectively using a technique called Group Relative Policy Optimization.

Why it matters?

This work is important because it significantly improves the ability of AI to understand and reason about videos. This opens up possibilities for more advanced video analysis, like automatically answering questions about movies, understanding complex events in surveillance footage, or even helping robots interpret their surroundings. The new model, Video-Thinker-7B, is currently the best performing model of its size for video reasoning tasks.

Abstract

Recent advances in image reasoning methods, particularly "Thinking with Images", have demonstrated remarkable success in Multimodal Large Language Models (MLLMs); however, this dynamic reasoning paradigm has not yet been extended to video reasoning tasks. In this paper, we propose Video-Thinker, which empowers MLLMs to think with videos by autonomously leveraging their intrinsic "grounding" and "captioning" capabilities to generate reasoning clues throughout the inference process. To spark this capability, we construct Video-Thinker-10K, a curated dataset featuring autonomous tool usage within chain-of-thought reasoning sequences. Our training strategy begins with Supervised Fine-Tuning (SFT) to learn the reasoning format, followed by Group Relative Policy Optimization (GRPO) to strengthen this reasoning capability. Through this approach, Video-Thinker enables MLLMs to autonomously navigate grounding and captioning tasks for video reasoning, eliminating the need for constructing and calling external tools. Extensive experiments demonstrate that Video-Thinker achieves significant performance gains on both in-domain tasks and challenging out-of-domain video reasoning benchmarks, including Video-Holmes, CG-Bench-Reasoning, and VRBench. Our Video-Thinker-7B substantially outperforms existing baselines such as Video-R1 and establishes state-of-the-art performance among 7B-sized MLLMs.

View Paper