RealDPO: Real or Not Real, that is the Preference
Guo Cheng, Danni Yang, Ziqi Huang, Jianlou Si, Chenyang Si, Ziwei Liu
2025-10-17
Summary
This paper focuses on improving how well AI can generate realistic videos, specifically when it comes to complex movements like those humans make every day.
What's the problem?
Current AI video generators are getting better at making videos look good overall, but they still struggle to create natural, smooth, and believable motions. The movements often look unnatural or don't quite fit the situation, which limits how useful these videos are in real-world applications.
What's the solution?
The researchers introduced a new method called RealDPO. Instead of just telling the AI what *to* do, they showed it examples of real-world videos as 'good' examples and let it learn by comparing its own attempts to these real movements. This uses a technique called Direct Preference Optimization, which is like giving the AI feedback on what it did wrong and letting it correct itself. They also created a new dataset of high-quality videos, called RealAction-5K, to help train and test their method.
Why it matters?
This work is important because more realistic video generation has many potential uses, like creating better special effects, training simulations, or even helping people with virtual reality experiences. By improving the quality of motion, this research brings us closer to AI-generated videos that are truly indistinguishable from real footage.
Abstract
Video generative models have recently achieved notable advancements in synthesis quality. However, generating complex motions remains a critical challenge, as existing models often struggle to produce natural, smooth, and contextually consistent movements. This gap between generated and real-world motions limits their practical applicability. To address this issue, we introduce RealDPO, a novel alignment paradigm that leverages real-world data as positive samples for preference learning, enabling more accurate motion synthesis. Unlike traditional supervised fine-tuning (SFT), which offers limited corrective feedback, RealDPO employs Direct Preference Optimization (DPO) with a tailored loss function to enhance motion realism. By contrasting real-world videos with erroneous model outputs, RealDPO enables iterative self-correction, progressively refining motion quality. To support post-training in complex motion synthesis, we propose RealAction-5K, a curated dataset of high-quality videos capturing human daily activities with rich and precise motion details. Extensive experiments demonstrate that RealDPO significantly improves video quality, text alignment, and motion realism compared to state-of-the-art models and existing preference optimization techniques.