PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Haoze Zhang, Tianyu Huang, Zichen Wan, Xiaowei Jin, Hongzhi Zhang, Hui Li, Wangmeng Zuo

2025-11-26

PhysChoreo: Physics-Controllable Video Generation with Part-Aware Semantic Grounding

Summary

This paper introduces a new system called PhysChoreo that creates realistic videos from a single image, focusing on making the movements and interactions within the video look physically believable.

What's the problem?

Current video generation models are really good at making videos *look* nice, but they often struggle to make those videos follow the rules of physics. Things might float when they should fall, or objects might pass through each other. Previous attempts to fix this using physics-based rendering were too complicated to accurately simulate complex movements over a long period of time, making it hard to control what happens in the video.

What's the solution?

PhysChoreo solves this by working in two steps. First, it figures out what each object in the original image is made of and how it would naturally behave physically – like whether it’s heavy or light, bouncy or rigid. Then, it uses this information to simulate how the objects would move and interact based on instructions, creating a video that looks realistic and follows physical laws. Essentially, it’s like digitally choreographing a physically accurate scene.

Why it matters?

This work is important because it allows for the creation of more believable and controllable videos. This could be useful in many areas, like creating special effects for movies, designing realistic simulations for training, or even generating content for virtual reality experiences where physical accuracy is crucial.

Abstract

While recent video generation models have achieved significant visual fidelity, they often suffer from the lack of explicit physical controllability and plausibility. To address this, some recent studies attempted to guide the video generation with physics-based rendering. However, these methods face inherent challenges in accurately modeling complex physical properties and effectively control ling the resulting physical behavior over extended temporal sequences. In this work, we introduce PhysChoreo, a novel framework that can generate videos with diverse controllability and physical realism from a single image. Our method consists of two stages: first, it estimates the static initial physical properties of all objects in the image through part-aware physical property reconstruction. Then, through temporally instructed and physically editable simulation, it synthesizes high-quality videos with rich dynamic behaviors and physical realism. Experimental results show that PhysChoreo can generate videos with rich behaviors and physical realism, outperforming state-of-the-art methods on multiple evaluation metrics.

View Paper