ProPhy: Progressive Physical Alignment for Dynamic World Simulation
Zijun Wang, Panwen Hu, Jing Wang, Terry Jingchen Zhang, Yuhao Cheng, Long Chen, Yiqiang Yan, Zutao Jiang, Hanhui Li, Xiaodan Liang
2025-12-08
Summary
This paper introduces a new method, ProPhy, for creating more realistic videos, especially when those videos involve things moving and interacting with each other in a physically believable way.
What's the problem?
Current video generation models are pretty good at making videos *look* nice, but they often fail when it comes to making those videos follow the rules of physics. Things might float when they should fall, or move in ways that just don't make sense, especially in complicated scenes. This happens because the models don't really 'understand' physics and treat all directions and forces equally, ignoring how things actually behave in the real world.
What's the solution?
ProPhy tackles this by using a two-part system. First, it figures out the general physical principles at play based on the text description of the video. Then, it focuses on the details of how things move, learning to represent physical dynamics at a very fine level. It also borrows some 'common sense' about physics from models that understand both images and language, helping it to create more accurate movements and interactions. Essentially, it's learning to pay attention to *how* physics affects things, not just *that* physics is involved.
Why it matters?
This research is important because more realistic video generation has a lot of potential. It could lead to better simulations for training robots, creating special effects in movies, or even designing new products. If we can create videos that accurately reflect the physical world, it opens up a lot of possibilities for using those videos in practical applications and for building more intelligent systems.
Abstract
Recent advances in video generation have shown remarkable potential for constructing world simulators. However, current models still struggle to produce physically consistent results, particularly when handling large-scale or complex dynamics. This limitation arises primarily because existing approaches respond isotropically to physical prompts and neglect the fine-grained alignment between generated content and localized physical cues. To address these challenges, we propose ProPhy, a Progressive Physical Alignment Framework that enables explicit physics-aware conditioning and anisotropic generation. ProPhy employs a two-stage Mixture-of-Physics-Experts (MoPE) mechanism for discriminative physical prior extraction, where Semantic Experts infer semantic-level physical principles from textual descriptions, and Refinement Experts capture token-level physical dynamics. This mechanism allows the model to learn fine-grained, physics-aware video representations that better reflect underlying physical laws. Furthermore, we introduce a physical alignment strategy that transfers the physical reasoning capabilities of vision-language models (VLMs) into the Refinement Experts, facilitating a more accurate representation of dynamic physical phenomena. Extensive experiments on physics-aware video generation benchmarks demonstrate that ProPhy produces more realistic, dynamic, and physically coherent results than existing state-of-the-art methods.