RLinf-VLA: A Unified and Efficient Framework for VLA+RL Training
Hongzhi Zang, Mingjie Wei, Si Xu, Yongji Wu, Zhen Guo, Yuanqing Wang, Hao Lin, Liangzhi Shi, Yuqing Xie, Zhexuan Xu, Zhihao Liu, Kang Chen, Wenhao Tang, Quanlu Zhang, Weinan Zhang, Chao Yu, Yu Wang
2025-10-09
Summary
This paper introduces a new system called RLinf-VLA designed to help robots learn tasks by trying things out and getting feedback, rather than just being told what to do. It's about making it easier to train robots using reinforcement learning and combining what they 'see' with what they 'do'.
What's the problem?
Currently, training robots to perform tasks involves either showing them exactly what to do (supervised learning) or letting them learn through trial and error (reinforcement learning). Supervised learning doesn't work well when things change, and reinforcement learning is difficult to set up and compare different learning methods fairly because there wasn't a standard way to do it. It's hard to efficiently manage all the computer resources needed for robots to learn in simulated environments.
What's the solution?
The researchers created RLinf-VLA, a system that provides a unified platform for training robots with reinforcement learning. It's designed to be flexible and efficient, especially when using powerful computers with multiple processors. It cleverly manages computer resources to speed up the learning process and allows researchers to easily test different robot designs, learning algorithms, and simulated environments all within the same framework. They also share insights into what works best when applying reinforcement learning to robots.
Why it matters?
This work is important because it makes it much easier for researchers to develop and test new ways to train robots. By providing a standardized system, it will accelerate progress in the field of 'embodied intelligence' – getting robots to understand and interact with the world around them. The system also shows that robots trained with reinforcement learning can perform better in the real world than those trained with traditional methods.
Abstract
Recent progress in vision and language foundation models has significantly advanced multimodal understanding, reasoning, and generation, inspiring a surge of interest in extending such capabilities to embodied settings through vision-language-action (VLA) models. Yet, most VLA models are still trained with supervised fine-tuning (SFT), which struggles to generalize under distribution shifts due to error accumulation. Reinforcement learning (RL) offers a promising alternative by directly optimizing task performance through interaction, but existing attempts remain fragmented and lack a unified platform for fair and systematic comparison across model architectures and algorithmic designs. To address this gap, we introduce RLinf-VLA, a unified and efficient framework for scalable RL training of VLA models. The system adopts a highly flexible resource allocation design that addresses the challenge of integrating rendering, training, and inference in RL+VLA training. In particular, for GPU-parallelized simulators, RLinf-VLA implements a novel hybrid fine-grained pipeline allocation mode, achieving a 1.61x-1.88x speedup in training. Through a unified interface, RLinf-VLA seamlessly supports diverse VLA architectures (e.g., OpenVLA, OpenVLA-OFT), multiple RL algorithms (e.g., PPO, GRPO), and various simulators (e.g., ManiSkill, LIBERO). In simulation, a unified model achieves 98.11\% across 130 LIBERO tasks and 97.66\% across 25 ManiSkill tasks. Beyond empirical performance, our study distills a set of best practices for applying RL to VLA training and sheds light on emerging patterns in this integration. Furthermore, we present preliminary deployment on a real-world Franka robot, where RL-trained policies exhibit stronger generalization than those trained with SFT. We envision RLinf-VLA as a foundation to accelerate and standardize research on embodied intelligence.