VideoMind: A Chain-of-LoRA Agent for Long Video Reasoning
Ye Liu, Kevin Qinghong Lin, Chang Wen Chen, Mike Zheng Shou
2025-03-18
Summary
This paper introduces VideoMind, a new AI system designed to better understand videos by focusing on how events unfold over time and directly linking answers to specific moments in the video.
What's the problem?
While AI has made progress in understanding language and images, understanding videos, especially long ones, remains a challenge. Current AI models struggle to connect information across time and pinpoint the exact moments in a video that support their conclusions.
What's the solution?
VideoMind uses a Chain-of-LoRA strategy with a role-based agentic workflow. It has different roles, including a planner, a grounder to find specific moments, a verifier to check accuracy, and an answerer. The Chain-of-LoRA allows for seamless role-switching without needing multiple models, balancing efficiency and flexibility.
Why it matters?
This work matters because it advances AI's ability to understand videos in a more detailed and context-aware way, which can be useful in applications like video search, video analysis, and creating more intelligent video-based assistants.
Abstract
Videos, with their unique temporal dimension, demand precise grounded understanding, where answers are directly linked to visual, interpretable evidence. Despite significant breakthroughs in reasoning capabilities within Large Language Models, multi-modal reasoning - especially for videos - remains unexplored. In this work, we introduce VideoMind, a novel video-language agent designed for temporal-grounded video understanding. VideoMind incorporates two key innovations: (i) We identify essential capabilities for video temporal reasoning and develop a role-based agentic workflow, including a planner for coordinating different roles, a grounder for temporal localization, a verifier to assess temporal interval accuracy, and an answerer for question-answering. (ii) To efficiently integrate these diverse roles, we propose a novel Chain-of-LoRA strategy, enabling seamless role-switching via lightweight LoRA adaptors while avoiding the overhead of multiple models, thus balancing efficiency and flexibility. Extensive experiments on 14 public benchmarks demonstrate that our agent achieves state-of-the-art performance on diverse video understanding tasks, including 3 on grounded video question-answering, 6 on video temporal grounding, and 5 on general video question-answering, underscoring its effectiveness in advancing video agent and long-form temporal reasoning.