VIDEOP2R: Video Understanding from Perception to Reasoning

Yifan Jiang, Yueying Wang, Rui Zhao, Toufiq Parag, Zhimin Chen, Zhenyu Liao, Jayakrishnan Unnikrishnan

2025-11-19

VIDEOP2R: Video Understanding from Perception to Reasoning

Summary

This paper focuses on improving how well large language models can understand and reason about videos. These models, called video language models, are getting really good, but making them even better at complex video tasks is still a challenge.

What's the problem?

Existing methods for improving language models, like a two-step process of initial training followed by reinforcement learning, don't easily translate to video. Videos are more complex than text because you first need to *see* what's happening (perception) and *then* think about it (reasoning). Current techniques treat these as one step, which isn't ideal for video understanding.

What's the solution?

The researchers created a new framework called VideoP2R. It breaks down the process into two distinct stages, mirroring how we understand videos. First, they created a large dataset of videos with detailed, step-by-step explanations of both what's happening in the video *and* the reasoning behind answers. Then, they used a special reinforcement learning technique that gives separate rewards for good perception and good reasoning. This helps the model learn to do both well.

Why it matters?

This work is important because it significantly improves the ability of video language models to perform complex reasoning tasks. The new framework achieves top results on several video understanding tests, showing that explicitly modeling perception and reasoning as separate steps is a powerful approach. This could lead to better AI systems that can truly understand and interact with the visual world.

Abstract

Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.

View Paper