PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Yuanhao Cai, Kunpeng Li, Menglin Jia, Jialiang Wang, Junzhe Sun, Feng Liang, Weifeng Chen, Felix Juefei-Xu, Chu Wang, Ali Thabet, Xiaoliang Dai, Xuan Ju, Alan Yuille, Ji Hou

2026-01-01

PhyGDPO: Physics-Aware Groupwise Direct Preference Optimization for Physically Consistent Text-to-Video Generation

Summary

This paper focuses on making videos generated by computers look more realistic, specifically by ensuring they follow the rules of physics. Current AI can create visually appealing videos from text descriptions, but often these videos show things happening in ways that don't make sense in the real world, like objects floating or moving unnaturally.

What's the problem?

The main issue is that it's hard to teach computers to understand and represent physics. Existing methods either rely on simplified, unrealistic environments or struggle to learn how things *should* behave. A big part of the problem is a lack of training data – there aren't many videos available that clearly demonstrate complex physical interactions, making it difficult for AI to learn from examples.

What's the solution?

The researchers developed a two-part solution. First, they created a large dataset of videos, called PhyVidGen-135K, using a clever technique that asks an AI to 'think through' physical scenarios and generate corresponding video descriptions. Second, they designed a new way to train the video generation AI, called PhyGDPO, which uses physics-based 'rewards' to encourage the AI to create videos that are physically consistent. They also made the training process more efficient by reducing memory usage.

Why it matters?

This work is important because it brings us closer to creating AI that can generate truly realistic videos. This has implications for many fields, including entertainment, education, and robotics. If we can create videos that accurately simulate the physical world, it opens up possibilities for creating more immersive and believable virtual experiences, and for training robots in simulated environments before deploying them in the real world.

Abstract

Recent advances in text-to-video (T2V) generation have achieved good visual quality, yet synthesizing videos that faithfully follow physical laws remains an open challenge. Existing methods mainly based on graphics or prompt extension struggle to generalize beyond simple simulated environments or learn implicit physical reasoning. The scarcity of training data with rich physics interactions and phenomena is also a problem. In this paper, we first introduce a Physics-Augmented video data construction Pipeline, PhyAugPipe, that leverages a vision-language model (VLM) with chain-of-thought reasoning to collect a large-scale training dataset, PhyVidGen-135K. Then we formulate a principled Physics-aware Groupwise Direct Preference Optimization, PhyGDPO, framework that builds upon the groupwise Plackett-Luce probabilistic model to capture holistic preferences beyond pairwise comparisons. In PhyGDPO, we design a Physics-Guided Rewarding (PGR) scheme that embeds VLM-based physics rewards to steer optimization toward physical consistency. We also propose a LoRA-Switch Reference (LoRA-SR) scheme that eliminates memory-heavy reference duplication for efficient training. Experiments show that our method significantly outperforms state-of-the-art open-source methods on PhyGenBench and VideoPhy2. Please check our project page at https://caiyuanhao1998.github.io/project/PhyGDPO for more video results. Our code, models, and data will be released at https://github.com/caiyuanhao1998/Open-PhyGDPO

View Paper