RynnVLA-002: A Unified Vision-Language-Action and World Model
Jun Cen, Siteng Huang, Yuqian Yuan, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Kehan Li, Hao Luo, Fan Wang, Xin Li, Deli Zhao, Hao Chen
2025-11-24
Summary
This paper introduces RynnVLA-002, a new artificial intelligence system that combines understanding of vision (what things look like), language (instructions), and action (what to do). It also builds a 'world model,' which is like an internal simulation of the environment.
What's the problem?
Traditionally, AI systems that interact with the world have either been good at understanding what they *see* and what they're *told* to do, or they've been good at predicting how the world will change based on actions. The problem is that these abilities are often separate, and neither works as well as it could on its own. Building an AI that can both understand its surroundings and plan effectively is a challenge.
What's the solution?
The researchers created RynnVLA-002, which tackles this problem by building a single system that learns both how the world works *and* how to act within it. The system uses visual information and actions to predict what will happen next, and it uses its understanding of the world to help it decide what actions to take. Essentially, the two parts – understanding and planning – constantly improve each other.
Why it matters?
This research is important because it shows that combining different AI capabilities can lead to much better performance. RynnVLA-002 significantly outperformed existing systems in both simulated environments and with a real-world robot, achieving very high success rates. This suggests a promising path towards creating more intelligent and capable robots that can operate effectively in complex, real-world situations.
Abstract
We introduce RynnVLA-002, a unified Vision-Language-Action (VLA) and world model. The world model leverages action and visual inputs to predict future image states, learning the underlying physics of the environment to refine action generation. Conversely, the VLA model produces subsequent actions from image observations, enhancing visual understanding and supporting the world model's image generation. The unified framework of RynnVLA-002 enables joint learning of environmental dynamics and action planning. Our experiments show that RynnVLA-002 surpasses individual VLA and world models, demonstrating their mutual enhancement. We evaluate RynnVLA-002 in both simulation and real-world robot tasks. RynnVLA-002 achieves 97.4% success rate on the LIBERO simulation benchmark without pretraining, while in real-world LeRobot experiments, its integrated world model boosts the overall success rate by 50%.