GigaBrain-0: A World Model-Powered Vision-Language-Action Model

GigaBrain Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jie Li, Jiagang Zhu, Lv Feng, Peng Li, Qiuping Deng, Runqi Ouyang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yilong Li, Yiran Ding, Yuan Xu, Yun Ye

2025-10-23

GigaBrain-0: A World Model-Powered Vision-Language-Action Model

Summary

This paper introduces GigaBrain-0, a new artificial intelligence model designed to help robots understand and interact with the world using both vision (what they see) and language (what they're told to do). It's a big step towards creating robots that can perform a wider variety of tasks more reliably.

What's the problem?

Training robots to do things generally requires a huge amount of real-world data collected by the robot itself. Getting this data is slow, expensive, and limits how well robots can adapt to new situations. Basically, robots need a lot of practice, and giving them that practice is hard.

What's the solution?

The researchers tackled this problem by using 'world models' – essentially, AI systems that can *create* realistic data, like videos of robots performing actions. They used these generated datasets, along with some clever techniques to help the robot understand depth and think through problems step-by-step, to train GigaBrain-0. This significantly reduced the need for real-world robot data. They also created a smaller, faster version called GigaBrain-0-Small for use on less powerful computers.

Why it matters?

This work is important because it makes it much more practical to build robots that can handle complex tasks in the real world. By reducing the reliance on expensive real-world data, it opens the door to creating robots that are more adaptable, robust, and capable of generalizing to new environments and situations. The smaller version also means these advanced capabilities can be used on robots that aren't super high-tech.

Abstract

Training Vision-Language-Action (VLA) models for generalist robots typically requires large-scale real-world robot data, which is expensive and time-consuming to collect. The inefficiency of physical data collection severely limits the scalability, and generalization capacity of current VLA systems. To address this challenge, we introduce GigaBrain-0, a novel VLA foundation model empowered by world model-generated data (e.g., video generation, real2real transfer, human transfer, view transfer, sim2real transfer data). By leveraging world models to generate diverse data at scale, GigaBrain-0 significantly reduces reliance on real robot data while improving cross-task generalization. Our approach further improves policy robustness through RGBD input modeling and embodied Chain-of-Thought (CoT) supervision, enabling the model to reason about spatial geometry, object states, and long-horizon dependencies during task execution. This leads to substantial gains in real-world performance on dexterous, long-horizon, and mobile manipulation tasks. Extensive experiments demonstrate that GigaBrain-0 achieves superior generalization across variations in appearances (e.g., textures, colors), object placements, and camera viewpoints. Additionally, we present GigaBrain-0-Small, an optimized lightweight variant designed to run efficiently on devices such as the NVIDIA Jetson AGX Orin.

View Paper