Do World Action Models Generalize Better than VLAs? A Robustness Study

Zhanguang Zhang, Zhiyuan Li, Behnam Rahmati, Rui Heng Yang, Yintao Ma, Amir Rasouli, Sajjad Pakdamansavoji, Yangzheng Wu, Lingfeng Zhang, Tongtong Cao, Feng Wen, Xinyu Wang, Xingyue Quan, Yingxue Zhang

2026-04-06

Do World Action Models Generalize Better than VLAs? A Robustness Study

Summary

This paper compares two different approaches to helping robots figure out what to do: Vision-Language-Action models (VLAs) and World Action Models (WAMs). Both try to allow robots to understand instructions and act in the real world, but they do it in different ways.

What's the problem?

Robots struggle with tasks in the real world because they need to understand what's happening *now* and also predict what will happen if they do something. Existing systems, like VLAs, are good but limited. They only work well in situations similar to what they were trained on and can easily get confused by changes in their surroundings. They need a lot of specific training data to work well in new situations.

What's the solution?

The researchers directly compared VLAs and WAMs on a few standard robot tasks, intentionally making things difficult by changing the visual appearance and the way instructions were given. WAMs are built on models that learn to predict what will happen in videos, and then use that understanding to decide what actions a robot should take. They found that WAMs were generally more reliable and adaptable than VLAs, even without needing as much specific robot training data. Some hybrid approaches, combining elements of both, showed improvement, but WAMs performed the best overall.

Why it matters?

This research is important because it suggests a new, more robust way to build robots that can handle the unpredictable nature of the real world. WAMs seem to be better at generalizing to new situations, meaning robots using this approach could be more useful in a wider range of environments and tasks without needing constant retraining.

Abstract

Robot action planning in the real world is challenging as it requires not only understanding the current state of the environment but also predicting how it will evolve in response to actions. Vision-language-action (VLA), which repurpose large-scale vision-language models for robot action generation using action experts, have achieved notable success across a variety of robotic tasks. Nevertheless, their performance remains constrained by the scope of their training data, exhibiting limited generalization to unseen scenarios and vulnerability to diverse contextual perturbations. More recently, world models have been revisited as an alternative to VLAs. These models, referred to as world action models (WAMs), are built upon world models that are trained on large corpora of video data to predict future states. With minor adaptations, their latent representation can be decoded into robot actions. It has been suggested that their explicit dynamic prediction capacity, combined with spatiotemporal priors acquired from web-scale video pretraining, enables WAMs to generalize more effectively than VLAs. In this paper, we conduct a comparative study of prominent state-of-the-art VLA policies and recently released WAMs. We evaluate their performance on the LIBERO-Plus and RoboTwin 2.0-Plus benchmarks under various visual and language perturbations. Our results show that WAMs achieve strong robustness, with LingBot-VA reaching 74.2% success rate on RoboTwin 2.0-Plus and Cosmos-Policy achieving 82.2% on LIBERO-Plus. While VLAs such as π_{0.5} can achieve comparable robustness on certain tasks, they typically require extensive training with diverse robotic datasets and varied learning objectives. Hybrid approaches that partially incorporate video-based dynamic learning exhibit intermediate robustness, highlighting the importance of how video priors are integrated.

View Paper