From Spatial to Actions: Grounding Vision-Language-Action Model in Spatial Foundation Priors
Zhengshen Zhang, Hao Li, Yalun Dai, Zhengbang Zhu, Lei Zhou, Chenchen Liu, Dong Wang, Francis E. H. Tay, Sijin Chen, Ziwei Liu, Yuxiao Liu, Xinghang Li, Pan Zhou
2025-10-29
Summary
This paper introduces FALCON, a new approach to help robots understand and interact with the 3D world around them based on both visual information and language instructions.
What's the problem?
Current robots that follow instructions and perform actions in the real world often rely on processing images like a 2D picture, even though the world is 3D. This creates a gap in their understanding of space, making it hard for them to adapt to new situations or accurately interpret instructions involving locations and spatial relationships. Existing attempts to add 3D understanding either need special equipment or don't provide enough detailed spatial information, messing up how the robot connects what it 'sees' with what it's told to do.
What's the solution?
The researchers developed FALCON, which adds special 'spatial tokens' representing 3D information directly into the part of the robot's 'brain' that controls actions. These tokens are created using models trained to understand 3D space from regular RGB images, and can be improved with depth or pose information if available, without needing to rebuild the whole system. Importantly, FALCON doesn't mix this spatial information with the parts of the system that understand language, keeping language processing clear and effective.
Why it matters?
FALCON significantly improves a robot's ability to perform tasks in both simulated environments and the real world. It works well even when things are messy or when instructions are given in different ways, and it's better at handling objects of different sizes and heights. This is a big step towards creating robots that can reliably navigate and interact with the complex 3D world around us.
Abstract
Existing vision-language-action (VLA) models act in 3D real-world but are typically built on 2D encoders, leaving a spatial reasoning gap that limits generalization and adaptability. Recent 3D integration techniques for VLAs either require specialized sensors and transfer poorly across modalities, or inject weak cues that lack geometry and degrade vision-language alignment. In this work, we introduce FALCON (From Spatial to Action), a novel paradigm that injects rich 3D spatial tokens into the action head. FALCON leverages spatial foundation models to deliver strong geometric priors from RGB alone, and includes an Embodied Spatial Model that can optionally fuse depth, or pose for higher fidelity when available, without retraining or architectural changes. To preserve language reasoning, spatial tokens are consumed by a Spatial-Enhanced Action Head rather than being concatenated into the vision-language backbone. These designs enable FALCON to address limitations in spatial representation, modality transferability, and alignment. In comprehensive evaluations across three simulation benchmarks and eleven real-world tasks, our proposed FALCON achieves state-of-the-art performance, consistently surpasses competitive baselines, and remains robust under clutter, spatial-prompt conditioning, and variations in object scale and height.