VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Qunzhong Wang, Jie Liu, Jiajun Liang, Yilei Jiang, Yuanxing Zhang, Jinyuan Chen, Yaozhi Zheng, Xintao Wang, Pengfei Wan, Xiangyu Yue, Jiaheng Liu

2025-10-17

VR-Thinker: Boosting Video Reward Models through Thinking-with-Image Reasoning

Summary

This paper introduces a new method, called VideoReward Thinker, to improve how AI models understand and evaluate videos, specifically when those models are used to create new videos. It focuses on making the AI's judgments more accurate and reliable.

What's the problem?

Current AI models that judge videos struggle with two main issues. First, processing video takes up a lot of the model's 'memory,' leaving less room for detailed information. This means the AI might miss important details. Second, all the video information is given to the AI at the very beginning, which can lead to the AI making things up or forgetting key parts of the video as it tries to reason about what's happening.

What's the solution?

The researchers developed VideoReward Thinker, which lets the AI actively 'think' with images. Instead of just getting all the video information at once, the AI can now choose specific frames to focus on and keep a running 'memory' of what it's seen. They trained the AI using a special process involving carefully selected examples and a technique called Group Relative Policy Optimization to improve its reasoning skills. This allows the AI to update its understanding as it 'watches' the video.

Why it matters?

This research is important because it significantly improves the performance of AI models in understanding and evaluating videos, especially longer ones. The new method achieves better results than existing open-source models on several video evaluation tests, meaning AI-generated videos can be judged more accurately and reliably, leading to better quality content.

Abstract

Recent advancements in multimodal reward models (RMs) have substantially improved post-training for visual generative models. However, current RMs face inherent limitations: (1) visual inputs consume large context budgets, forcing fewer frames and causing loss of fine-grained details; and (2) all visual information is packed into the initial prompt, exacerbating hallucination and forgetting during chain-of-thought reasoning. To overcome these issues, we introduce VideoReward Thinker (VR-Thinker), a thinking-with-image framework that equips the RM with visual reasoning operations (e.g., select frame) and a configurable visual memory window. This allows the RM to actively acquire and update visual evidence within context limits, improving reasoning fidelity and reliability. We activate visual reasoning via a reinforcement fine-tuning pipeline: (i) Cold Start with curated visual chain-of-thought data to distill basic reasoning skills and operation formatting; (ii) select samples whose per-dimension and overall judgments are all correct, then conduct Rejection sampling Fine-Tuning on these high-quality traces to further enhance reasoning; and (iii) apply Group Relative Policy Optimization (GRPO) to strengthen reasoning. Our approach delivers state-of-the-art accuracy among open-source models on video preference benchmarks, especially for longer videos: a 7B VR-Thinker achieves 80.5% on VideoGen Reward, 82.3% on GenAI-Bench, and 75.6% on MJ-Bench-Video. These results validate the effectiveness and promise of thinking-with-image multimodal reward modeling.

View Paper