VideoVLA: Video Generators Can Be Generalizable Robot Manipulators
Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, Baining Guo
2025-12-09
Summary
This research explores how to make robots better at doing new things, even if they haven't been specifically programmed for them. It focuses on giving robots instructions in plain language and having them understand what to do in different situations.
What's the problem?
Current robots that use vision, language, and actions together struggle to adapt to new tasks, objects, or environments. They're good at what they've been trained on, but fall apart when faced with something unfamiliar. Essentially, they lack the ability to 'imagine' what will happen when they take an action, which limits their flexibility.
What's the solution?
The researchers created a system called VideoVLA. It uses powerful video generation models – the kind that can create realistic videos – and adapts them to help robots plan actions. Instead of just predicting *what* action to take, VideoVLA also predicts *what will happen* if the robot does that action, creating a sort of visual 'forecast'. This is all done by combining video, language instructions, and potential actions into one model.
Why it matters?
This work is important because it shows a new way to build robots that can truly generalize. By having robots 'imagine' the consequences of their actions, they can handle unexpected situations and learn new skills more easily. This is a big step towards robots that can operate effectively in the real world and even contribute to the development of more advanced artificial intelligence.
Abstract
Generalization in robot manipulation is essential for deploying robots in open-world environments and advancing toward artificial general intelligence. While recent Vision-Language-Action (VLA) models leverage large pre-trained understanding models for perception and instruction following, their ability to generalize to novel tasks, objects, and settings remains limited. In this work, we present VideoVLA, a simple approach that explores the potential of transforming large video generation models into robotic VLA manipulators. Given a language instruction and an image, VideoVLA predicts an action sequence as well as the future visual outcomes. Built on a multi-modal Diffusion Transformer, VideoVLA jointly models video, language, and action modalities, using pre-trained video generative models for joint visual and action forecasting. Our experiments show that high-quality imagined futures correlate with reliable action predictions and task success, highlighting the importance of visual imagination in manipulation. VideoVLA demonstrates strong generalization, including imitating other embodiments' skills and handling novel objects. This dual-prediction strategy - forecasting both actions and their visual consequences - explores a paradigm shift in robot learning and unlocks generalization capabilities in manipulation systems.