TimeSearch-R: Adaptive Temporal Search for Long-Form Video Understanding via Self-Verification Reinforcement Learning
Junwen Pan, Qizhe Zhang, Rui Zhang, Ming Lu, Xin Wan, Yuan Zhang, Chang Liu, Qi She
2025-11-12
Summary
This paper introduces a new method, TimeSearch-R, for quickly finding the most important parts of long videos to help computers understand what's happening. It's like giving a computer the ability to skim a video efficiently instead of watching the whole thing.
What's the problem?
Currently, methods for finding relevant video sections rely on pre-defined rules created by humans. These rules aren't always the best and don't allow the system to learn the *best* way to search. Also, when using a learning approach called reinforcement learning to guide the search, the system can sometimes make choices too early without fully exploring the video, leading to incomplete understanding and flawed reasoning.
What's the solution?
TimeSearch-R uses reinforcement learning to teach the computer how to search for important video sections, but it adds a clever check. After the computer selects some sections, it uses the *same* reasoning process to double-check if those sections are enough to understand the video. This 'completeness self-verification' ensures the computer doesn't stop searching too soon. They also created special datasets to help train the system, focusing on videos where understanding requires paying attention to events happening over time.
Why it matters?
This research is important because it significantly improves how well computers can understand long videos. It achieves better results than previous methods on several challenging video understanding tasks, meaning computers can now more accurately answer questions about videos, summarize them, and generally 'understand' what's going on. This has implications for things like video editing, content analysis, and even robotics.
Abstract
Temporal search aims to identify a minimal set of relevant frames from tens of thousands based on a given query, serving as a foundation for accurate long-form video understanding. Existing works attempt to progressively narrow the search space. However, these approaches typically rely on a hand-crafted search process, lacking end-to-end optimization for learning optimal search strategies. In this paper, we propose TimeSearch-R, which reformulates temporal search as interleaved text-video thinking, seamlessly integrating searching video clips into the reasoning process through reinforcement learning (RL). However, applying RL training methods, such as Group Relative Policy Optimization (GRPO), to video reasoning can result in unsupervised intermediate search decisions. This leads to insufficient exploration of the video content and inconsistent logical reasoning. To address these issues, we introduce GRPO with Completeness Self-Verification (GRPO-CSV), which gathers searched video frames from the interleaved reasoning process and utilizes the same policy model to verify the adequacy of searched frames, thereby improving the completeness of video reasoning. Additionally, we construct datasets specifically designed for the SFT cold-start and RL training of GRPO-CSV, filtering out samples with weak temporal dependencies to enhance task difficulty and improve temporal search capabilities. Extensive experiments demonstrate that TimeSearch-R achieves significant improvements on temporal search benchmarks such as Haystack-LVBench and Haystack-Ego4D, as well as long-form video understanding benchmarks like VideoMME and MLVU. Notably, TimeSearch-R establishes a new state-of-the-art on LongVideoBench with 4.1% improvement over the base model Qwen2.5-VL and 2.0% over the advanced video reasoning model Video-R1. Our code is available at https://github.com/Time-Search/TimeSearch-R.