World Action Models are Zero-shot Policies
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim
2026-02-19
Summary
This paper introduces DreamZero, a new type of AI model for robots that allows them to learn how to interact with the world more effectively by watching videos and practicing, rather than just following specific instructions.
What's the problem?
Current AI models for robots, called Vision-Language-Action models, are good at understanding what you *want* them to do, but they struggle when faced with new physical situations or environments they haven't seen before. They have trouble figuring out how to actually *do* things in a dynamic world, especially when it comes to movements and actions that require understanding physics.
What's the solution?
The researchers created DreamZero, which learns by predicting what will happen next in a video and then figuring out what actions the robot needs to take to make those predictions come true. It's like the robot is imagining the future and then trying to create that future. They used a powerful video processing technique and optimized it to work in real-time on a robot, allowing it to react quickly to its surroundings. They also showed it could learn from videos of other robots or even people, and quickly adapt to new robot bodies.
Why it matters?
This is important because it allows robots to be much more flexible and adaptable. Instead of needing to be specifically programmed for every task and environment, they can learn from observation and experience, just like humans do. This makes robots more useful in real-world situations where things are constantly changing, and it opens the door to robots that can learn new skills with very little training data.
Abstract
State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.