Factorized Learning for Temporally Grounded Video-Language Models
Wenzheng Zeng, Difei Gao, Mike Zheng Shou, Hwee Tou Ng
2026-01-01
Summary
This paper focuses on improving how well computers understand videos by connecting what's happening in the video with descriptions of those events, specifically focusing on pinpointing *when* an event happens within the video.
What's the problem?
Current video understanding systems often try to figure out both *when* something happens in a video and *what* is happening at the same time, treating them as one big task. This isn't ideal because accurately knowing *when* something happens is actually fundamental to correctly describing *what* is happening. It's like trying to write a good story without knowing the order of events – it gets messy. Existing methods don't clearly prioritize getting the timing right, leading to less accurate overall understanding.
What's the solution?
The researchers developed a new system called D^2VLM that breaks down the problem into two steps: first, precisely identify the moments in the video that correspond to an event (temporal grounding), and then, based on those moments, provide a description. They use special 'evidence tokens' to help the system focus on the important visual details during the timing step, going beyond just looking at timestamps. They also created a new way to train the system, called factorized preference optimization, which specifically encourages it to get the timing right before focusing on the description. To help with training, they even made a new dataset designed for this two-step learning process.
Why it matters?
This work is important because it shows that by tackling video understanding in a more structured way – focusing on timing first – we can significantly improve a computer’s ability to truly understand what’s happening in a video. This has implications for many applications, like video search, automated video editing, and robots that need to understand their surroundings.
Abstract
Recent video-language models have shown great potential for video understanding, but still struggle with accurate temporal grounding for event-level perception. We observe that two main factors in video understanding (i.e., temporal grounding and textual response) form a logical hierarchy: accurate temporal evidence grounding lays the foundation for reliable textual response. However, existing works typically handle these two tasks in a coupled manner without a clear logical structure, leading to sub-optimal objectives. We address this from a factorized learning perspective. We first propose D^2VLM, a framework that decouples the learning of these two tasks while also emphasizing their inherent dependency. We adopt a "grounding then answering with evidence referencing" paradigm and introduce evidence tokens for evidence grounding, which emphasize event-level visual semantic capture beyond the focus on timestamp representation in existing works. To further facilitate the learning of these two tasks, we introduce a novel factorized preference optimization (FPO) algorithm. Unlike standard preference optimization, FPO explicitly incorporates probabilistic temporal grounding modeling into the optimization objective, enabling preference learning for both temporal grounding and textual response. We also construct a synthetic dataset to address the lack of suitable datasets for factorized preference learning with explicit temporal grounding. Experiments on various tasks demonstrate the clear advantage of our approach. Our source code is available at https://github.com/nusnlp/d2vlm.