VideoSSR: Video Self-Supervised Reinforcement Learning

Zefeng He, Xiaoye Qu, Yafu Li, Siyuan Huang, Daizong Liu, Yu Cheng

2025-11-12

VideoSSR: Video Self-Supervised Reinforcement Learning

Summary

This paper focuses on improving how well AI models, specifically Multimodal Large Language Models (MLLMs), can understand videos. It explores a way to create better training data for these models without relying heavily on expensive human labeling.

What's the problem?

Currently, MLLMs are getting better quickly, but the existing video datasets they learn from aren't complex enough to keep up. Creating new, high-quality video datasets with human annotations is very costly and time-consuming. The core issue is a lack of sufficient, challenging training data for these advanced AI models to truly master video understanding.

What's the solution?

The researchers came up with a method to let the videos themselves help create training data. They designed three tasks – finding unusual events in videos, counting objects, and putting video clips back in the correct order – that force the AI to really analyze the video content. They then used these tasks within a new learning framework called VideoSSR to automatically generate training data and improve the model's ability to understand videos. They also created a new dataset, VideoSSR-30K, to test their approach.

Why it matters?

This work is important because it offers a way to overcome the bottleneck of expensive data labeling. By allowing videos to 'teach' the AI models themselves, we can build more powerful video understanding systems without constantly needing human intervention. The improvements shown across various video-related tasks demonstrate that this self-supervised learning approach is a promising path forward for advancing AI's ability to interpret and understand the visual world.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has substantially advanced the video understanding capabilities of Multimodal Large Language Models (MLLMs). However, the rapid progress of MLLMs is outpacing the complexity of existing video datasets, while the manual annotation of new, high-quality data remains prohibitively expensive. This work investigates a pivotal question: Can the rich, intrinsic information within videos be harnessed to self-generate high-quality, verifiable training data? To investigate this, we introduce three self-supervised pretext tasks: Anomaly Grounding, Object Counting, and Temporal Jigsaw. We construct the Video Intrinsic Understanding Benchmark (VIUBench) to validate their difficulty, revealing that current state-of-the-art MLLMs struggle significantly on these tasks. Building upon these pretext tasks, we develop the VideoSSR-30K dataset and propose VideoSSR, a novel video self-supervised reinforcement learning framework for RLVR. Extensive experiments across 17 benchmarks, spanning four major video domains (General Video QA, Long Video QA, Temporal Grounding, and Complex Reasoning), demonstrate that VideoSSR consistently enhances model performance, yielding an average improvement of over 5\%. These results establish VideoSSR as a potent foundational framework for developing more advanced video understanding in MLLMs. The code is available at https://github.com/lcqysl/VideoSSR.

View Paper