RynnVLA-001: Using Human Demonstrations to Improve Robot Manipulation
Yuming Jiang, Siteng Huang, Shengke Xue, Yaxi Zhao, Jun Cen, Sicong Leng, Kehan Li, Jiayan Guo, Kexiang Wang, Mingxiu Chen, Fan Wang, Deli Zhao, Xin Li
2025-09-19
Summary
This paper introduces RynnVLA-001, a new artificial intelligence model that understands both images/videos and language to then perform actions. It's designed to help robots learn how to do things by watching videos of humans and understanding instructions.
What's the problem?
Teaching robots to perform complex tasks is hard because it requires them to understand what they're seeing, what's being asked of them, and then translate that into physical actions. Existing models weren't very good at combining all three of these things – vision, language, and action – especially when learning from real-world videos of people doing things.
What's the solution?
The researchers used a two-step training process. First, they showed the model a huge number of videos of people interacting with objects, teaching it to predict what happens next in the video based on a starting point and a text instruction. Then, they added another layer that focuses on predicting the movement of key body parts, linking what the model *sees* with what actions are being *performed*. They also developed a way to simplify the way the model represents actions, making it more efficient. This whole system is called RynnVLA-001.
Why it matters?
This work is important because it creates a more effective way to train robots. By learning from how humans actually perform tasks, and by better understanding language instructions, robots can become more capable and adaptable in real-world situations. RynnVLA-001 outperformed other existing models, showing that this new training method is a significant step forward in robotics and AI.
Abstract
This paper presents RynnVLA-001, a vision-language-action(VLA) model built upon large-scale video generative pretraining from human demonstrations. We propose a novel two-stage pretraining methodology. The first stage, Ego-Centric Video Generative Pretraining, trains an Image-to-Video model on 12M ego-centric manipulation videos to predict future frames conditioned on an initial frame and a language instruction. The second stage, Human-Centric Trajectory-Aware Modeling, extends this by jointly predicting future keypoint trajectories, thereby effectively bridging visual frame prediction with action prediction. Furthermore, to enhance action representation, we propose ActionVAE, a variational autoencoder that compresses sequences of actions into compact latent embeddings, reducing the complexity of the VLA output space. When finetuned on the same downstream robotics datasets, RynnVLA-001 achieves superior performance over state-of-the-art baselines, demonstrating that the proposed pretraining strategy provides a more effective initialization for VLA models.