Reinforcing Action Policies by Prophesying

Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, Li Zhang

2025-11-27

Reinforcing Action Policies by Prophesying

Summary

This paper introduces a new method, called ProphRL, for improving how robots learn to follow instructions involving both vision and language. It focuses on making robots more reliable and adaptable when performing tasks in the real world, even if the situation is slightly different from what they were originally trained on.

What's the problem?

Currently, robots learning from demonstrations (like watching someone do a task) often struggle when faced with new situations. This is because they memorize the specific examples they saw instead of truly understanding the task. While reinforcement learning, where robots learn by trial and error, could help, it's slow and expensive to do in the real world, and creating accurate robot simulations is difficult. Essentially, existing methods aren't data efficient or stable enough for real-world robot learning.

What's the solution?

The researchers developed ProphRL, which has three main parts. First, they created 'Prophet,' a system that learns to predict what will happen when a robot takes a certain action. It's trained on lots of different robot data, making it adaptable to new robots, objects, and environments. Second, they adapted a reinforcement learning technique called Flow-GRPO to work with the way robots take actions in these vision-language scenarios. Finally, they introduced 'FlowScale,' a method to fine-tune the learning process and make it more stable. Prophet acts like a simulator, and the improved reinforcement learning helps the robot learn the best actions within that simulated environment.

Why it matters?

This work is important because it makes robot learning more practical and efficient. By combining a learned world model (Prophet) with a refined reinforcement learning approach, robots can learn complex tasks with less real-world experimentation and adapt more easily to changing conditions. The experiments showed significant improvements in task success, both in simulated environments and, crucially, on actual robots, paving the way for more reliable and versatile robotic assistants.

Abstract

Vision-Language-Action (VLA) policies excel in aligning language, perception, and robot control. However, most VLAs are trained purely by imitation, which overfits to demonstrations, and is brittle under distribution shift. Reinforcement learning (RL) directly optimizes task reward and thus addresses this misalignment, but real-robot interaction is expensive and conventional simulators are hard to engineer and transfer. We address both data efficiency and optimization stability in VLA post-training via a learned world model and an RL procedure tailored to flow-based action heads. Specifically, we introduce Prophet, a unified action-to-video robot actuation pretrained across large-scale, heterogeneous robot data to learn reusable action-outcome dynamics. It is able to few-shot adapt to new robots, objects, and environments, yielding a rollout-ready simulator. Upon Prophet, we reinforce action policies with Flow-action-GRPO (FA-GRPO), which adapts Flow-GRPO to operate on VLA actions, and with FlowScale, a stepwise reweighting that rescales per-step gradients in the flow head. Together, Prophet, FA-GRPO, and FlowScale constitute ProphRL, a practical, data- and compute-efficient path to VLA post-training. Experiments show 5-17% success gains on public benchmarks and 24-30% gains on real robots across different VLA variants.

View Paper