Temporal Preference Optimization for Long-Form Video Understanding

Rui Li, Xiaohan Wang, Yuhui Zhang, Zeyu Wang, Serena Yeung-Levy

2025-01-24

Temporal Preference Optimization for Long-Form Video Understanding

Summary

This paper talks about a new way to make AI better at understanding long videos called Temporal Preference Optimization (TPO). It's like teaching a computer to pay attention to the right parts of a movie and understand how different scenes are connected over time.

What's the problem?

Current AI models are good at understanding short videos, but they struggle with longer ones, like full-length movies or TV shows. It's hard for them to keep track of what's happening over a long time and understand how earlier parts of the video relate to later parts. This is like trying to understand a whole book by only reading a few pages at a time.

What's the solution?

The researchers created TPO, which is like a special training program for AI. It teaches the AI to tell the difference between good and bad ways of understanding a video's timeline. TPO does this by showing the AI lots of examples of correct and incorrect interpretations of video segments and entire videos. This helps the AI learn to focus on the important parts and understand how different scenes are connected, even in really long videos.

Why it matters?

This matters because it could make AI much better at understanding long videos, which is important for things like automatically summarizing movies, creating better video recommendations, or even helping robots understand long sequences of events. It could lead to smarter AI assistants that can discuss entire movies or TV series, or help in fields like education where understanding long video content is crucial. By making AI better at grasping the big picture in videos, TPO could open up new possibilities for how we interact with and learn from video content.

Abstract

Despite significant advancements in video large multimodal models (video-LMMs), achieving effective temporal grounding in long-form videos remains a challenge for existing models. To address this limitation, we propose Temporal Preference Optimization (TPO), a novel post-training framework designed to enhance the temporal grounding capabilities of video-LMMs through preference learning. TPO adopts a self-training approach that enables models to differentiate between well-grounded and less accurate temporal responses by leveraging curated preference datasets at two granularities: localized temporal grounding, which focuses on specific video segments, and comprehensive temporal grounding, which captures extended temporal dependencies across entire video sequences. By optimizing on these preference datasets, TPO significantly enhances temporal understanding while reducing reliance on manually annotated data. Extensive experiments on three long-form video understanding benchmarks--LongVideoBench, MLVU, and Video-MME--demonstrate the effectiveness of TPO across two state-of-the-art video-LMMs. Notably, LLaVA-Video-TPO establishes itself as the leading 7B model on the Video-MME benchmark, underscoring the potential of TPO as a scalable and efficient solution for advancing temporal reasoning in long-form video understanding. Project page: https://ruili33.github.io/tpo_website.

View Paper