WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, Song Guo

2025-11-13

WMPO: World Model-based Policy Optimization for Vision-Language-Action Models

Summary

This paper introduces a new method called World-Model-based Policy Optimization, or WMPO, to help robots learn to do things more effectively using both vision (what they see) and language (instructions). It aims to make robots better at figuring things out on their own, even when they make mistakes.

What's the problem?

Current robots that use vision and language often learn by watching humans demonstrate tasks. This is limiting because they can't easily learn from their own mistakes or correct themselves. While reinforcement learning allows robots to learn through trial and error, it usually requires a huge amount of practice in the real world, which is time-consuming and can even damage the robot. Basically, getting a robot to learn complex tasks without tons of human help or breaking itself is really hard.

What's the solution?

The researchers developed WMPO, which lets a robot learn *without* constantly interacting with the real world. Instead, it builds a 'world model' – a way for the robot to imagine what will happen if it takes certain actions. Unlike other world models that focus on simplified representations, WMPO uses actual images to make these predictions, which helps it align better with what the robot already 'knows' from seeing lots of pictures. Importantly, WMPO uses a learning method that improves performance as it learns, rather than relying on older methods that can be less effective.

Why it matters?

This work is important because it makes robot learning much more efficient. Robots can learn complex tasks with less real-world practice, and they can even start to show behaviors like self-correction, meaning they can fix their own mistakes. It also means robots can continue learning over time and adapt to new situations, making them more versatile and useful in the long run.

Abstract

Vision-Language-Action (VLA) models have shown strong potential for general-purpose robotic manipulation, but their reliance on expert demonstrations limits their ability to learn from failures and perform self-corrections. Reinforcement learning (RL) addresses these through self-improving interactions with the physical environment, but suffers from high sample complexity on real robots. We introduce World-Model-based Policy Optimization (WMPO), a principled framework for on-policy VLA RL without interacting with the real environment. In contrast to widely used latent world models, WMPO focuses on pixel-based predictions that align the "imagined" trajectories with the VLA features pretrained with web-scale images. Crucially, WMPO enables the policy to perform on-policy GRPO that provides stronger performance than the often-used off-policy methods. Extensive experiments in both simulation and real-robot settings demonstrate that WMPO (i) substantially improves sample efficiency, (ii) achieves stronger overall performance, (iii) exhibits emergent behaviors such as self-correction, and (iv) demonstrates robust generalization and lifelong learning capabilities.

View Paper