GigaWorld-Policy: An Efficient Action-Centered World--Action Model
Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Hengtao Li, Jie Li, Jindi Lv, Jingyu Liu, Min Cao, Peng Li, Qiuping Deng, Wenjun Mei, Xiaofeng Wang, Xinze Chen, Xinyu Zhou, Yang Wang, Yifan Chang, Yifan Li, Yukun Zhou, Yun Ye
2026-03-19
Summary
This paper introduces a new approach called GigaWorld-Policy for teaching robots how to perform tasks by letting them learn from watching videos and then practicing. It builds on existing methods that use pre-trained video models, but aims to make the process faster and more reliable.
What's the problem?
Current methods that use video to help robots learn have two main issues. First, figuring out both what will happen in a video *and* what action the robot should take at the same time takes a lot of computing power, slowing things down. Second, if the video prediction isn't perfect, it can throw off the robot's ability to predict the best action to take, because the video and action predictions are too closely linked.
What's the solution?
GigaWorld-Policy tackles these problems by focusing on actions first. The system predicts what sequence of actions a robot should take based on what it sees, and *then* separately predicts what the video will look like if those actions are performed. This separation makes things faster because the robot doesn't need to constantly predict the video while deciding what to do. Also, the system is trained to make both accurate action predictions and realistic videos, which helps it learn physically plausible movements. Importantly, the video prediction isn't even needed when the robot is actually performing the task, making it even faster.
Why it matters?
This research is important because it makes robot learning more practical. GigaWorld-Policy is significantly faster than previous methods – running nine times faster than a leading approach – and also achieves much higher success rates on real-world robotic tasks. This means robots can learn more efficiently and perform tasks more reliably, bringing us closer to robots that can help us in everyday life.
Abstract
World-Action Models (WAM) initialized from pre-trained video generation backbones have demonstrated remarkable potential for robot policy learning. However, existing approaches face two critical bottlenecks that hinder performance and deployment. First, jointly reasoning over future visual dynamics and corresponding actions incurs substantial inference overhead. Second, joint modeling often entangles visual and motion representations, making motion prediction accuracy heavily dependent on the quality of future video forecasts. To address these issues, we introduce GigaWorld-Policy, an action-centered WAM that learns 2D pixel-action dynamics while enabling efficient action decoding, with optional video generation. Specifically, we formulate policy training into two coupled components: the model predicts future action sequences conditioned on the current observation, and simultaneously generates future videos conditioned on the predicted actions and the same observation. The policy is supervised by both action prediction and video generation, providing richer learning signals and encouraging physically plausible actions through visual-dynamics constraints. With a causal design that prevents future-video tokens from influencing action tokens, explicit future-video generation is optional at inference time, allowing faster action prediction during deployment. To support this paradigm, we curate a diverse, large-scale robot dataset to pre-train an action-centered video generation model, which is then adapted as the backbone for robot policy learning. Experimental results on real-world robotic platforms show that GigaWorld-Policy runs 9x faster than the leading WAM baseline, Motus, while improving task success rates by 7%. Moreover, compared with pi-0.5, GigaWorld-Policy improves performance by 95% on RoboTwin 2.0.