Green-VLA: Staged Vision-Language-Action Model for Generalist Robots
I. Apanasevich, M. Artemyev, R. Babakyan, P. Fedotova, D. Grankin, E. Kupryashin, A. Misailidi, D. Nerus, A. Nutalapati, G. Sidorov, I. Efremov, M. Gerasyov, D. Pikurov, Y. Senchenko, S. Davidenko, D. Kulikov, M. Sultankin, K. Askarbek, O. Shamanin, D. Statovoy, E. Zalyaev, I. Zorin
2026-02-03
Summary
This paper introduces Green-VLA, a new system designed to help robots understand instructions involving both vision (what they see) and language (what they're told to do), and then carry out those instructions with physical actions. It's built to work with different types of robots, not just one specific model.
What's the problem?
Getting robots to reliably follow complex instructions in the real world is really hard. Existing systems often struggle when moved to a new robot body or when faced with situations they haven't specifically been trained for. They also need a lot of data and careful tuning to work well, and it's difficult to make them safe and efficient over long tasks.
What's the solution?
The researchers created Green-VLA, which learns in stages. First, it uses existing vision-language models as a base. Then, it learns to connect what it 'sees' to what instructions mean. Next, it's trained using data from multiple robot types to become more adaptable. After that, it's fine-tuned for a specific robot. Finally, it uses reinforcement learning to improve its performance and safety. They also developed a way to process a huge amount of robot demonstration data and a common way for the system to control different robot parts. During operation, the system predicts how well it's doing, identifies potentially unsafe situations, and guides itself to select the correct targets.
Why it matters?
This work is important because it makes robots more versatile and easier to use. By creating a system that can generalize across different robot bodies and learn from a lot of data, it brings us closer to having robots that can reliably help with everyday tasks in complex, real-world environments. The improvements in safety and efficiency are also crucial for real-world deployment.
Abstract
We introduce Green-VLA, a staged Vision-Language-Action (VLA) framework for real-world deployment on the Green humanoid robot while maintaining generalization across diverse embodiments. Green-VLA follows a five stage curriculum: (L0) foundational VLMs, (L1) multimodal grounding, (R0) multi-embodiment pretraining, (R1) embodiment-specific adaptation, and (R2) reinforcement-learning (RL) policy alignment. We couple a scalable data-processing pipeline (3,000 hours of demonstrations) with temporal alignment and quality filtering, and use a unified, embodiment-aware action interface enabling a single policy to control humanoids, mobile manipulators, and fixed-base arms. At inference, the VLA controller is enhanced with episode-progress prediction, out-of-distribution detection, and joint-prediction-based guidance to improve safety and precise target selection. Experiments on Simpler BRIDGE WidowX and CALVIN ABC-D, as well as real-robot evaluations, show strong generalization and performance gains from RL alignment in success rate, robustness, and long-horizon efficiency.