Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Haibo Wang, Zhiyang Xu, Yu Cheng, Shizhe Diao, Yufan Zhou, Yixin Cao, Qifan Wang, Weifeng Ge, Lifu Huang

2024-10-09

Grounded-VideoLLM: Sharpening Fine-grained Temporal Grounding in Video Large Language Models

Summary

This paper introduces Grounded-VideoLLM, a new type of video large language model (Video-LLM) that improves how well AI can understand and reason about specific moments in videos.

What's the problem?

While existing Video-LLMs are good at understanding general aspects of videos, they struggle with detailed tasks that require pinpointing exact moments or actions in a video. This is mainly because they don't effectively manage time-related information or represent timestamps accurately, making it hard for them to perform well on complex video tasks.

What's the solution?

To solve this problem, the authors developed Grounded-VideoLLM, which includes two key innovations: an additional temporal stream to better understand the relationships between different frames in a video and discrete temporal tokens that provide specific time information. They also used a multi-stage training approach, starting with simple tasks and gradually introducing more complex ones. This helps the model learn effectively how to ground its understanding of videos in time.

Why it matters?

This research is important because it enhances the ability of AI to understand videos at a much finer level. By improving how models can identify and reason about specific moments in videos, Grounded-VideoLLM could be used in various applications like video summarization, content creation, and even interactive video assistants, making AI more useful in everyday scenarios.

Abstract

Video Large Language Models (Video-LLMs) have demonstrated remarkable capabilities in coarse-grained video understanding, however, they struggle with fine-grained temporal grounding. In this paper, we introduce Grounded-VideoLLM, a novel Video-LLM adept at perceiving and reasoning over specific video moments in a fine-grained manner. We identify that current Video-LLMs have limitations for fine-grained video understanding since they lack effective temporal modeling and timestamp representation. In light of this, we sharpen our model by incorporating (1) an additional temporal stream to encode the relationships between frames and (2) discrete temporal tokens enriched with specific time knowledge to represent timestamps. To optimize the training of Grounded-VideoLLM, we employ a multi-stage training scheme, beginning with simple video-captioning tasks and progressively introducing video temporal grounding tasks of increasing complexity. To further enhance Grounded-VideoLLM's temporal reasoning capability, we also curate a grounded VideoQA dataset by an automatic annotation pipeline. Extensive experiments demonstrate that Grounded-VideoLLM not only excels in fine-grained grounding tasks such as temporal sentence grounding, dense video captioning, and grounded VideoQA, but also shows great potential as a versatile video assistant for general video understanding.

View Paper