Visual Spatial Tuning

Rui Yang, Ziyu Zhu, Yanwei Li, Jingjia Huang, Shen Yan, Siyuan Zhou, Zhe Liu, Xiangtai Li, Shuangye Li, Wenqian Wang, Yi Lin, Hengshuang Zhao

2025-11-10

Summary

This paper focuses on improving how well artificial intelligence, specifically Vision-Language Models (VLMs), understands and reasons about spatial relationships – things like where objects are in relation to each other. It aims to give these models a more human-like ability to perceive and think about space.

What's the problem?

Current VLMs struggle with understanding spatial concepts as well as humans do. Previous attempts to improve this ability often involved adding complex components to the models, which made them slower and sometimes worse at other tasks. The core issue is enhancing spatial understanding *without* sacrificing the model’s overall performance.

What's the solution?

The researchers developed a framework called Visual Spatial Tuning (VST). This involves two main parts: creating large datasets specifically designed to teach spatial skills (VST-P for perception and VST-R for reasoning) and then using a two-step training process. First, the model learns basic spatial knowledge through regular training, and then reinforcement learning is used to refine its spatial reasoning abilities. This approach focuses on improving the existing model rather than adding new parts.

Why it matters?

This work is important because better spatial understanding is crucial for creating AI that can interact with the physical world effectively. By improving VLMs’ ability to reason about space, we can build AI systems that are more capable in areas like robotics, navigation, and even understanding everyday scenes. The fact that this improvement doesn’t negatively impact other abilities makes it a significant step towards more versatile and generally intelligent AI.

Abstract

Capturing spatial relationships from visual inputs is a cornerstone of human-like general intelligence. Several previous studies have tried to enhance the spatial awareness of Vision-Language Models (VLMs) by adding extra expert encoders, which brings extra overhead and usually harms general capabilities. To enhance the spatial ability in general architectures, we introduce Visual Spatial Tuning (VST), a comprehensive framework to cultivate VLMs with human-like visuospatial abilities, from spatial perception to reasoning. We first attempt to enhance spatial perception in VLMs by constructing a large-scale dataset termed VST-P, which comprises 4.1 million samples spanning 19 skills across single views, multiple images, and videos. Then, we present VST-R, a curated dataset with 135K samples that instruct models to reason in space. In particular, we adopt a progressive training pipeline: supervised fine-tuning to build foundational spatial knowledge, followed by reinforcement learning to further improve spatial reasoning abilities. Without the side-effect to general capabilities, the proposed VST consistently achieves state-of-the-art results on several spatial benchmarks, including 34.8% on MMSI-Bench and 61.2% on VSIBench. It turns out that the Vision-Language-Action models can be significantly enhanced with the proposed spatial tuning paradigm, paving the way for more physically grounded AI.

View Paper