Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Jie Du, Xinyu Gong, Qingshan Tan, Wen Li, Yangming Cheng, Weitao Wang, Chenlu Zhan, Suhui Wu, Hao Zhang, Jun Zhang

2025-11-05

Reg-DPO: SFT-Regularized Direct Preference Optimization with GT-Pair for Improving Video Generation

Summary

This paper focuses on improving how well AI generates videos, building on a newer technique called Direct Preference Optimization (DPO).

What's the problem?

While DPO works well for making images, applying it to videos is much harder. Creating good training data for videos is expensive and time-consuming, training the AI can be unstable, and videos require a lot of computer memory. Existing methods also haven't been able to effectively scale up to larger, more powerful video generation models.

What's the solution?

The researchers came up with two main ideas. First, they created a way to automatically generate good examples for the AI to learn from, using real videos as 'good' examples and AI-generated videos as 'bad' examples – they call this GT-Pair. Second, they added a bit of a 'safety net' to the DPO process, using a technique called SFT loss to make the training more stable and the videos look more realistic. They also used clever computer tricks to allow the AI to handle much larger video models than previously possible.

Why it matters?

This work is important because it allows for the creation of higher-quality videos using AI, and it makes it possible to train these AI models more efficiently and on a larger scale. This could lead to significant advancements in areas like creating realistic special effects, generating content for entertainment, and even developing new tools for visual communication.

Abstract

Recent studies have identified Direct Preference Optimization (DPO) as an efficient and reward-free approach to improving video generation quality. However, existing methods largely follow image-domain paradigms and are mainly developed on small-scale models (approximately 2B parameters), limiting their ability to address the unique challenges of video tasks, such as costly data construction, unstable training, and heavy memory consumption. To overcome these limitations, we introduce a GT-Pair that automatically builds high-quality preference pairs by using real videos as positives and model-generated videos as negatives, eliminating the need for any external annotation. We further present Reg-DPO, which incorporates the SFT loss as a regularization term into the DPO objective to enhance training stability and generation fidelity. Additionally, by combining the FSDP framework with multiple memory optimization techniques, our approach achieves nearly three times higher training capacity than using FSDP alone. Extensive experiments on both I2V and T2V tasks across multiple datasets demonstrate that our method consistently outperforms existing approaches, delivering superior video generation quality.

View Paper