GigaWorld-0: World Models as Data Engine to Empower Embodied AI

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou

2025-11-26

GigaWorld-0: World Models as Data Engine to Empower Embodied AI

Summary

This paper introduces GigaWorld-0, a new system for creating realistic simulated environments used to train artificial intelligence, specifically AI that interacts with the physical world. It's designed to generate lots of data for AI to learn from without needing as much real-world experience.

What's the problem?

Training AI to perform tasks in the real world, like robotics, requires a huge amount of data. Getting this data is expensive and time-consuming because it involves real-world interactions. Existing simulations often aren't realistic enough, meaning AI trained in them doesn't perform well when moved to the real world. The challenge is to create a simulation that's both scalable – meaning it can generate a lot of data – and realistic enough to allow AI to learn effectively.

What's the solution?

The researchers built GigaWorld-0, which has two main parts. The first, GigaWorld-0-Video, creates visually appealing and diverse videos of simulated environments, controlling things like what objects look like, camera angles, and the actions happening. The second, GigaWorld-0-3D, focuses on making the simulation physically accurate, ensuring objects behave realistically and that motion planning works correctly. They also developed a way to train this system efficiently, using less computer memory and processing power. This allows them to generate a massive amount of training data.

Why it matters?

This work is important because it allows AI to learn complex physical tasks much more efficiently. By training on data generated by GigaWorld-0, AI models can perform better in the real world, even without extensive real-world training. This could significantly speed up the development of robots and other AI systems that interact with our physical environment, making them more capable and adaptable.

Abstract

World models are emerging as a foundational paradigm for scalable, data-efficient embodied AI. In this work, we present GigaWorld-0, a unified world model framework designed explicitly as a data engine for Vision-Language-Action (VLA) learning. GigaWorld-0 integrates two synergistic components: GigaWorld-0-Video, which leverages large-scale video generation to produce diverse, texture-rich, and temporally coherent embodied sequences under fine-grained control of appearance, camera viewpoint, and action semantics; and GigaWorld-0-3D, which combines 3D generative modeling, 3D Gaussian Splatting reconstruction, physically differentiable system identification, and executable motion planning to ensure geometric consistency and physical realism. Their joint optimization enables the scalable synthesis of embodied interaction data that is visually compelling, spatially coherent, physically plausible, and instruction-aligned. Training at scale is made feasible through our efficient GigaTrain framework, which exploits FP8-precision and sparse attention to drastically reduce memory and compute requirements. We conduct comprehensive evaluations showing that GigaWorld-0 generates high-quality, diverse, and controllable data across multiple dimensions. Critically, VLA model (e.g., GigaBrain-0) trained on GigaWorld-0-generated data achieve strong real-world performance, significantly improving generalization and task success on physical robots without any real-world interaction during training.

View Paper