When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Pengcheng Fang, Yuxia Chen, Rui Guo

2025-08-22

When and What: Diffusion-Grounded VideoLLM with Entity Aware Segmentation for Long Video Understanding

Summary

This paper introduces a new video understanding model called Grounded VideoDiT, which aims to improve how well computers can understand what's happening in videos, specifically focusing on *when* things happen and *how* objects interact over time.

What's the problem?

Current video understanding models, even the advanced ones using large language models, struggle with precise timing and keeping track of objects throughout a video. They often treat videos as a series of disconnected images, making it hard to understand events accurately or connect actions to specific objects. Essentially, they're good at generally understanding a video, but bad at pinpointing details about timing and object relationships.

What's the solution?

The researchers tackled this problem with three main ideas. First, they improved how the model processes time by making it more sensitive to changes between frames, ensuring a smoother understanding of events unfolding. Second, they made the model explicitly link what it's asked about (like a specific object) to where that object actually appears in the video. Finally, they added a system that uses specific 'time tokens' to help the model understand and reason about exact timestamps within the video.

Why it matters?

This work is important because it pushes video understanding closer to human-level comprehension. By improving the model's ability to understand *when* and *how* things happen, it opens the door to more sophisticated applications like detailed video analysis, better video search, and more accurate answers to questions about video content.

Abstract

Understanding videos requires more than answering open ended questions, it demands the ability to pinpoint when events occur and how entities interact across time. While recent Video LLMs have achieved remarkable progress in holistic reasoning, they remain coarse in temporal perception: timestamps are encoded only implicitly, frame level features are weak in capturing continuity, and language vision alignment often drifts from the entities of interest. In this paper, we present Grounded VideoDiT, a Video LLM designed to overcome these limitations by introducing three key innovations. First, a Diffusion Temporal Latent (DTL) encoder enhances boundary sensitivity and maintains temporal consistency. Second, object grounded representations explicitly bind query entities to localized visual evidence, strengthening alignment. Third, a mixed token scheme with discrete temporal tokens provides explicit timestamp modeling, enabling fine grained temporal reasoning. Together, these designs equip Grounded VideoDiT with robust grounding capabilities, as validated by state of the art results on Charades STA, NExT GQA, and multiple VideoQA benchmarks.

View Paper