GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou

2026-02-13

GigaBrain-0.5M*: a VLA That Learns From World Model-Based Reinforcement Learning

Summary

This paper introduces GigaBrain-0.5M*, a new vision-language-action (VLA) model designed to better understand and interact with the physical world, specifically focusing on robotic manipulation tasks.

What's the problem?

Existing VLA models struggle with complex, multi-step tasks because they have trouble fully understanding what's happening in a scene and predicting what will happen next. They're limited in their ability to 'think ahead' and plan effectively. Essentially, they react to what they see *right now* instead of anticipating future needs.

What's the solution?

The researchers built GigaBrain-0.5M* by starting with a strong foundation called GigaBrain-0.5, which was already good at understanding videos of robots working. Then, they added a technique called 'world model-based reinforcement learning' using something called RAMP. This allows the model to learn from simulated experiences and adapt to new tasks more easily, improving its ability to plan and execute long sequences of actions.

Why it matters?

This work is important because it significantly improves a robot's ability to perform complex tasks like folding laundry, packing boxes, and making espresso. The model is reliable enough to consistently complete these tasks in the real world, which is a big step towards creating robots that can truly assist people with everyday activities. It shows that combining video understanding with planning and learning is a powerful approach for robotics.

Abstract

Vision-language-action (VLA) models that directly predict multi-step action chunks from current observations face inherent limitations due to constrained scene understanding and weak future anticipation capabilities. In contrast, video world models pre-trained on web-scale video corpora exhibit robust spatiotemporal reasoning and accurate future prediction, making them a natural foundation for enhancing VLA learning. Therefore, we propose GigaBrain-0.5M*, a VLA model trained via world model-based reinforcement learning. Built upon GigaBrain-0.5, which is pre-trained on over 10,000 hours of robotic manipulation data, whose intermediate version currently ranks first on the international RoboChallenge benchmark. GigaBrain-0.5M* further integrates world model-based reinforcement learning via RAMP (Reinforcement leArning via world Model-conditioned Policy) to enable robust cross-task adaptation. Empirical results demonstrate that RAMP achieves substantial performance gains over the RECAP baseline, yielding improvements of approximately 30\% on challenging tasks including Laundry Folding, Box Packing, and Espresso Preparation. Critically, GigaBrain-0.5M^* exhibits reliable long-horizon execution, consistently accomplishing complex manipulation tasks without failure as validated by real-world deployment videos on our https://gigabrain05m.github.io{project page}.

View Paper