Spatial-Aware VLA Pretraining through Visual-Physical Alignment from Human Videos
Yicheng Feng, Wanpeng Zhang, Ye Wang, Hao Luo, Haoqi Yuan, Sipeng Zheng, Zongqing Lu
2025-12-16
Summary
This paper introduces a new way to train robots using both vision (what they see) and language (instructions), aiming to make them better at understanding and interacting with the 3D world around them.
What's the problem?
Currently, many robots are trained using 2D images, but they need to *act* in a 3D environment. This creates a disconnect – the robot sees a flat picture but has to perform actions in a space with depth and volume, making it hard for them to accurately connect what they see to what they need to do.
What's the solution?
The researchers developed a method called Spatial-Aware VLA Pretraining. Essentially, they showed the robot a lot of videos of humans interacting with objects, and importantly, they also provided information about the 3D location of those objects and the actions being performed. This helps the robot learn to associate what things *look* like in a 2D image with their actual position and how to interact with them in 3D space. They built a specific model, VIPA-VLA, that uses a special part to understand 3D visual information.
Why it matters?
This work is important because it allows robots to better understand their surroundings and follow instructions more reliably. By bridging the gap between 2D vision and 3D action, robots can become more adaptable and perform tasks in the real world with greater success, leading to more robust and generally useful robotic systems.
Abstract
Vision-Language-Action (VLA) models provide a promising paradigm for robot learning by integrating visual perception with language-guided policy learning. However, most existing approaches rely on 2D visual inputs to perform actions in 3D physical environments, creating a significant gap between perception and action grounding. To bridge this gap, we propose a Spatial-Aware VLA Pretraining paradigm that performs explicit alignment between visual space and physical space during pretraining, enabling models to acquire 3D spatial understanding before robot policy learning. Starting from pretrained vision-language models, we leverage large-scale human demonstration videos to extract 3D visual and 3D action annotations, forming a new source of supervision that aligns 2D visual observations with 3D spatial reasoning. We instantiate this paradigm with VIPA-VLA, a dual-encoder architecture that incorporates a 3D visual encoder to augment semantic visual representations with 3D-aware features. When adapted to downstream robot tasks, VIPA-VLA achieves significantly improved grounding between 2D vision and 3D action, resulting in more robust and generalizable robotic policies.