Robot Learning from a Physical World Model

Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, Yue Wang

2025-11-11

Robot Learning from a Physical World Model

Summary

This paper introduces PhysWorld, a new system that lets robots learn how to perform tasks just by watching videos, without needing to physically practice the tasks themselves.

What's the problem?

Current AI can create realistic videos of robots doing things, but simply copying those movements doesn't work well in the real world because those videos don't account for physics. A robot might try to move something in a way that's impossible or unstable because the video didn't 'understand' how things actually work physically.

What's the solution?

PhysWorld solves this by combining video generation with a 'physical world model'. Basically, it creates a video of a robot doing a task, *and* simultaneously builds a digital understanding of the objects and physics involved. Then, it uses this understanding to refine the robot's movements, making them physically realistic and accurate. It uses a special type of learning called 'object-centric residual reinforcement learning' to make sure the actions are grounded in the physical model.

Why it matters?

This is important because it means robots can learn complex tasks without needing tons of real-world training data, which is expensive and time-consuming to collect. It allows robots to perform tasks they've never seen before, making them much more adaptable and useful in various situations. It's a step towards robots that can truly understand and interact with the physical world.

Abstract

We introduce PhysWorld, a framework that enables robot learning from video generation through physical world modeling. Recent video generation models can synthesize photorealistic visual demonstrations from language commands and images, offering a powerful yet underexplored source of training signals for robotics. However, directly retargeting pixel motions from generated videos to robots neglects physics, often resulting in inaccurate manipulations. PhysWorld addresses this limitation by coupling video generation with physical world reconstruction. Given a single image and a task command, our method generates task-conditioned videos and reconstructs the underlying physical world from the videos, and the generated video motions are grounded into physically accurate actions through object-centric residual reinforcement learning with the physical world model. This synergy transforms implicit visual guidance into physically executable robotic trajectories, eliminating the need for real robot data collection and enabling zero-shot generalizable robotic manipulation. Experiments on diverse real-world tasks demonstrate that PhysWorld substantially improves manipulation accuracy compared to previous approaches. Visit https://pointscoder.github.io/PhysWorld_Web/{the project webpage} for details.

View Paper