π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Kang Chen, Zhihao Liu, Tonghe Zhang, Zhen Guo, Si Xu, Hao Lin, Hongzhi Zang, Quanlu Zhang, Zhaofei Yu, Guoliang Fan, Tiejun Huang, Yu Wang, Chao Yu

2025-11-03

π_RL: Online RL Fine-tuning for Flow-based Vision-Language-Action Models

Summary

This paper introduces a new framework, pi_{RL}, designed to help robots learn complex tasks by combining vision, language, and action. It focuses on improving how robots learn from both what they see and what instructions they're given, ultimately making them better at performing tasks in the real world.

What's the problem?

Teaching robots to do things based on instructions is hard, and usually requires a lot of example data. While researchers are trying to use reinforcement learning – where robots learn by trial and error – to automatically collect this data, it's been difficult to apply this to a specific type of robot learning model called 'flow-based VLAs'. These models have a tricky mathematical structure that makes it hard for the robot to figure out how good its actions are, hindering the learning process.

What's the solution?

The researchers created pi_{RL}, a system that allows robots to learn these flow-based VLAs through simulation. It uses two different learning methods. One, called 'Flow-Noise', simplifies the learning process by treating the robot's actions as a series of steps with clear rewards. The other, 'Flow-SDE', integrates the robot's learning directly with its interaction with the environment, using a clever mathematical trick to make exploration more efficient. They tested this in simulated environments with many different tasks happening at once.

Why it matters?

This work is important because it significantly improves the ability of robots to learn complex tasks from instructions. By making reinforcement learning work better with flow-based VLAs, the researchers have created a system that can learn much more effectively and generalize to new situations. This means robots can potentially become more helpful and adaptable in real-world scenarios, requiring less manual programming and example data.

Abstract

Vision-Language-Action (VLA) models enable robots to understand and perform complex tasks from multimodal input. Although recent work explores using reinforcement learning (RL) to automate the laborious data collection process in scaling supervised fine-tuning (SFT), applying large-scale RL to flow-based VLAs (e.g., pi_0, pi_{0.5}) remains challenging due to intractable action log-likelihoods from iterative denoising. We address this challenge with pi_{RL}, an open-source framework for training flow-based VLAs in parallel simulation. pi_{RL} implements two RL algorithms: (1) {Flow-Noise} models the denoising process as a discrete-time MDP with a learnable noise network for exact log-likelihood computation. (2) {Flow-SDE} integrates denoising with agent-environment interaction, formulating a two-layer MDP that employs ODE-to-SDE conversion for efficient RL exploration. We evaluate pi_{RL} on LIBERO and ManiSkill benchmarks. On LIBERO, pi_{RL} boosts few-shot SFT models pi_0 and pi_{0.5} from 57.6% to 97.6% and from 77.1% to 98.3%, respectively. In ManiSkill, we train pi_{RL} in 320 parallel environments, improving pi_0 from 41.6% to 85.7% and pi_{0.5} from 40.0% to 84.8% across 4352 pick-and-place tasks, demonstrating scalable multitask RL under heterogeneous simulation. Overall, pi_{RL} achieves significant performance gains and stronger generalization over SFT-models, validating the effectiveness of online RL for flow-based VLAs.

View Paper