LongVideoAgent: Multi-Agent Reasoning with Long Videos

Runtao Liu, Ziyi Liu, Jiaqi Tang, Yue Ma, Renjie Pi, Jipeng Zhang, Qifeng Chen

2025-12-24

LongVideoAgent: Multi-Agent Reasoning with Long Videos

Summary

This paper introduces a new way to help computers answer questions about long videos, like entire TV episodes, by having different 'agent' programs work together.

What's the problem?

Current methods for answering questions about long videos often struggle because they either shorten the video too much, losing important details, or don't have the right tools to understand everything happening. This makes it hard for the computer to accurately pinpoint *when* and *what* in the video answers the question, and they miss subtle visual clues.

What's the solution?

The researchers created a system with three main parts: a 'master' program that plans the process, a 'grounding' program that finds the specific parts of the video relevant to the question, and a 'vision' program that describes what's happening in those video clips. The master program learns through trial and error (reinforcement learning) to efficiently use the other two agents, focusing on the most important video segments and combining what's said with what's shown. This allows the system to explain *how* it arrived at an answer.

Why it matters?

This work is important because it shows a promising path towards building AI that can truly understand and reason about long-form video content, like movies and TV shows. It’s a step towards computers being able to answer complex questions about these videos with greater accuracy and provide a clear explanation of their reasoning, which is much better than simply summarizing or relying on limited information.

Abstract

Recent advances in multimodal LLMs and systems that use tools for long-video QA point to the promise of reasoning over hour-long episodes. However, many methods still compress content into lossy summaries or rely on limited toolsets, weakening temporal grounding and missing fine-grained cues. We propose a multi-agent framework in which a master LLM coordinates a grounding agent to localize question-relevant segments and a vision agent to extract targeted textual observations. The master agent plans with a step limit, and is trained with reinforcement learning to encourage concise, correct, and efficient multi-agent cooperation. This design helps the master agent focus on relevant clips via grounding, complements subtitles with visual detail, and yields interpretable trajectories. On our proposed LongTVQA and LongTVQA+ which are episode-level datasets aggregated from TVQA/TVQA+, our multi-agent system significantly outperforms strong non-agent baselines. Experiments also show reinforcement learning further strengthens reasoning and planning for the trained agent. Code and data will be shared at https://longvideoagent.github.io/.

View Paper