SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, Dehui Wang, Dingxiang Luo, Yuchen Fan, Youbang Sun, Jia Zeng, Jiangmiao Pang, Shanghang Zhang, Yu Wang, Yao Mu, Bowen Zhou, Ning Ding

2025-09-12

SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

Summary

This paper explores how to make robots better at doing things based on both what they see and what instructions they're given, using a type of artificial intelligence called Vision-Language-Action (VLA) models.

What's the problem?

Currently, teaching these VLA models is difficult because it requires a huge amount of example data showing a human controlling a robot, which is expensive and time-consuming to collect. Also, these models often struggle when asked to do things slightly differently than what they've been specifically trained on – they don't generalize well to new situations.

What's the solution?

The researchers developed a new system called SimpleVLA-RL that uses a technique called reinforcement learning to train the VLA model. Instead of relying on tons of human examples, the robot learns by trying things out and getting 'rewards' for doing well. They made this reinforcement learning process more efficient by improving how the robot explores different actions, how quickly they can run simulations, and how they calculate the learning signals. They applied this to an existing VLA model called OpenVLA-OFT and showed it worked better than previous methods on several robotic tasks.

Why it matters?

This work is important because it reduces the need for massive datasets of human demonstrations, making it easier and cheaper to train robots. It also allows robots to be more adaptable and perform well even when faced with new or unexpected tasks, and even discover new ways to solve problems that weren't explicitly programmed into them – like a new way to manipulate objects they hadn't seen before.

Abstract

Vision-Language-Action (VLA) models have recently emerged as a powerful paradigm for robotic manipulation. Despite substantial progress enabled by large-scale pretraining and supervised fine-tuning (SFT), these models face two fundamental challenges: (i) the scarcity and high cost of large-scale human-operated robotic trajectories required for SFT scaling, and (ii) limited generalization to tasks involving distribution shift. Recent breakthroughs in Large Reasoning Models (LRMs) demonstrate that reinforcement learning (RL) can dramatically enhance step-by-step reasoning capabilities, raising a natural question: Can RL similarly improve the long-horizon step-by-step action planning of VLA? In this work, we introduce SimpleVLA-RL, an efficient RL framework tailored for VLA models. Building upon veRL, we introduce VLA-specific trajectory sampling, scalable parallelization, multi-environment rendering, and optimized loss computation. When applied to OpenVLA-OFT, SimpleVLA-RL achieves SoTA performance on LIBERO and even outperforms pi_0 on RoboTwin 1.0\&2.0 with the exploration-enhancing strategies we introduce. SimpleVLA-RL not only reduces dependence on large-scale data and enables robust generalization, but also remarkably surpasses SFT in real-world tasks. Moreover, we identify a novel phenomenon ``pushcut'' during RL training, wherein the policy discovers previously unseen patterns beyond those seen in the previous training process. Github: https://github.com/PRIME-RL/SimpleVLA-RL

View Paper