RLinf-Co: Reinforcement Learning-Based Sim-Real Co-Training for VLA Models
Liangzhi Shi, Shuaihang Chen, Feng Gao, Yinuo Chen, Kang Chen, Tonghe Zhang, Hongzhi Zhang, Weinan Zhang, Chao Yu, Yu Wang
2026-02-16
Summary
This paper explores a better way to train robots to understand and follow instructions involving both vision and language, using computer simulations to help. It focuses on improving how robots learn to perform tasks in the real world by first practicing in a simulated environment.
What's the problem?
Currently, training robots with vision-language-action (VLA) often requires a lot of expensive, real-world demonstrations. While simulations can help reduce this cost, existing methods treat the simulation as a fixed set of examples. They don't allow the robot to actively *learn* through trial and error within the simulation, which limits how well the robot performs and adapts in the real world. Simply showing the robot examples isn't enough for strong real-world performance.
What's the solution?
The researchers developed a new framework called RL-Co, which stands for Reinforcement Learning Co-training. It works in two steps: first, the robot gets a basic understanding of the task by learning from both real and simulated examples. Then, the robot continues to learn and improve by practicing the task repeatedly in the simulation, using a method called reinforcement learning. To prevent the robot from 'forgetting' what it learned from the real world, they also include a small amount of real-world data during the simulation training. This helps the robot maintain its real-world skills while benefiting from the extensive practice in simulation.
Why it matters?
This research is important because it provides a more effective and practical way to train robots. By combining simulation with reinforcement learning and real-world data, the robots learn faster, perform better, and can handle new variations of tasks more easily. This makes it more feasible to deploy robots in real-world situations, reducing the need for massive amounts of expensive real-world training data and improving their overall usefulness.
Abstract
Simulation offers a scalable and low-cost way to enrich vision-language-action (VLA) training, reducing reliance on expensive real-robot demonstrations. However, most sim-real co-training methods rely on supervised fine-tuning (SFT), which treats simulation as a static source of demonstrations and does not exploit large-scale closed-loop interaction. Consequently, real-world gains and generalization are often limited. In this paper, we propose an \textit{RL}-based sim-real \textit{Co}-training (RL-Co) framework that leverages interactive simulation while preserving real-world capabilities. Our method follows a generic two-stage design: we first warm-start the policy with SFT on a mixture of real and simulated demonstrations, then fine-tune it with reinforcement learning in simulation while adding an auxiliary supervised loss on real-world data to anchor the policy and mitigate catastrophic forgetting. We evaluate our framework on four real-world tabletop manipulation tasks using two representative VLA architectures, OpenVLA and π_{0.5}, and observe consistent improvements over real-only fine-tuning and SFT-based co-training, including +24% real-world success on OpenVLA and +20% on π_{0.5}. Beyond higher success rates, RL co-training yields stronger generalization to unseen task variations and substantially improved real-world data efficiency, providing a practical and scalable pathway for leveraging simulation to enhance real-robot deployment.