DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware Regressive GRPO

Jinyoung Park, Jeehye Na, Jinyoung Kim, Hyunwoo J. Kim

2025-06-16

DeepVideo-R1: Video Reinforcement Fine-Tuning via Difficulty-aware
Regressive GRPO

Summary

This paper talks about DeepVideo-R1, a new method to make large language models better at understanding and reasoning about videos. It introduces an advanced way to train the model using a technique called Reg-GRPO and also uses a special type of data preparation that focuses on making the model handle harder video tasks better.

What's the problem?

The problem is that large language models often struggle to fully understand videos, especially when it involves reasoning through the sequence of events or complex details over time. Training these models to get better at video reasoning is hard because videos have lots of frames and complex information, and normal training methods don’t focus well on challenging video cases that require deep thought.

What's the solution?

The solution is to use the Reg-GRPO approach, a regression-based version of a training technique that helps the model learn better by giving it more precise feedback on how well it is doing with video reasoning. In addition, the method includes difficulty-aware data augmentation, which means the model is trained with video data pieces that are carefully chosen or changed to be harder, encouraging it to improve its reasoning skills over time.

Why it matters?

This matters because videos are everywhere and being able to understand and reason about them is important for many AI applications like video search, description, and helping robots or systems watch and understand what's happening. DeepVideo-R1 helps make AI systems smarter about videos by teaching them to think more deeply and handle tougher tasks, leading to better performance in real-world video understanding.

Abstract

DeepVideo-R1 enhances video reasoning performance using Reg-GRPO, a regression-based GRPO approach, and difficulty-aware data augmentation for video large language models.

View Paper