VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference Optimization for Large Video Models

Haojian Huang, Haodong Chen, Shengqiong Wu, Meng Luo, Jinlan Fu, Xinya Du, Hanwang Zhang, Hao Fei

2025-04-18

VistaDPO: Video Hierarchical Spatial-Temporal Direct Preference
Optimization for Large Video Models

Summary

This paper talks about VistaDPO, a new system that helps large video AI models better understand and match what people want when they describe or ask about videos, by improving how the models handle both space and time in videos.

What's the problem?

The problem is that big video AI models often get confused or make mistakes when trying to connect text (like questions or descriptions) with the right parts of a video. They might misunderstand what’s happening in the video or even make up details that aren’t really there, which is called hallucination. This happens because the models struggle to keep track of both where things are in the video and how they change over time.

What's the solution?

The researchers created VistaDPO, a framework that organizes and optimizes how the model looks at both the locations and timing of things in videos, using a layered approach. This helps the model line up what people are asking or describing with the correct parts of the video, reducing mistakes and hallucinations.

Why it matters?

This matters because it makes video AI models more reliable and accurate, which is important for things like video search, content moderation, and creating helpful tools for people who need to understand or work with video information.

Abstract

VistaDPO, a new framework, enhances text-video preference alignment through hierarchical spatial-temporal optimization, addressing misalignment and hallucination in large video models.

View Paper