SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Jitesh Jain, Jialuo Li, Zixian Ma, Jieyu Zhang, Chris Dongjoo Kim, Sangho Lee, Rohun Tripathi, Tanmay Gupta, Christopher Clark, Humphrey Shi

2025-12-18

SAGE: Training Smart Any-Horizon Agents for Long Video Reasoning with Reinforcement Learning

Summary

This paper explores how to make AI better at understanding and reasoning about long videos, mimicking how humans naturally watch videos – sometimes skimming, sometimes focusing intently.

What's the problem?

Current AI models designed to understand videos struggle with length. They typically try to process the entire video at once, which is computationally expensive and inefficient. It's like forcing a student to memorize an entire textbook before answering a question, instead of letting them refer back to specific sections as needed. These models aren't flexible in how they approach different video lengths or complexities.

What's the solution?

The researchers developed a system called SAGE, which acts like an agent that can decide how to watch a video. It can either process the whole thing at once for simpler tasks or break it down into multiple steps, revisiting parts as necessary for more complex questions. They also created a way to automatically generate training data using another AI model, Gemini, and a special training process using reinforcement learning to teach SAGE to reason effectively across different video lengths. Finally, they built a new set of long videos, SAGE-Bench, to specifically test these kinds of reasoning abilities.

Why it matters?

This work is important because it moves AI closer to understanding videos the way humans do. This could lead to significant improvements in areas like video search, automated video summarization, and AI assistants that can truly understand and interact with video content. The ability to efficiently process long videos opens up possibilities for analyzing things like lectures, movies, and surveillance footage more effectively.

Abstract

As humans, we are natural any-horizon reasoners, i.e., we can decide whether to iteratively skim long videos or watch short ones in full when necessary for a given task. With this in mind, one would expect video reasoning models to reason flexibly across different durations. However, SOTA models are still trained to predict answers in a single turn while processing a large number of frames, akin to watching an entire long video, requiring significant resources. This raises the question: Is it possible to develop performant any-horizon video reasoning systems? Inspired by human behavior, we first propose SAGE, an agent system that performs multi-turn reasoning on long videos while handling simpler problems in a single turn. Secondly, we introduce an easy synthetic data generation pipeline using Gemini-2.5-Flash to train the orchestrator, SAGE-MM, which lies at the core of SAGE. We further propose an effective RL post-training recipe essential for instilling any-horizon reasoning ability in SAGE-MM. Thirdly, we curate SAGE-Bench with an average duration of greater than 700 seconds for evaluating video reasoning ability in real-world entertainment use cases. Lastly, we empirically validate the effectiveness of our system, data, and RL recipe, observing notable improvements of up to 6.1% on open-ended video reasoning tasks, as well as an impressive 8.2% improvement on videos longer than 10 minutes.

View Paper