Make Geometry Matter for Spatial Reasoning

Shihua Zhang, Qiuhong Shen, Shizun Wang, Tianbo Pan, Xinchao Wang

2026-03-31

Make Geometry Matter for Spatial Reasoning

Summary

This paper focuses on improving how well vision-language models, which are good at understanding images and videos, can understand spatial relationships – things like where objects are positioned relative to each other. It tackles the issue that these models often rely too much on just looking at the picture and don't fully use information about the 3D geometry of the scene.

What's the problem?

Current vision-language models struggle with spatial reasoning, even when given extra information about the 3D structure of a scene. Simply adding 3D information as extra data doesn't automatically make the model *use* that information effectively. The models tend to stick with what they already know – interpreting the 2D image – and ignore the helpful 3D cues, leading to inaccurate understanding of spatial relationships.

What's the solution?

The researchers developed a framework called GeoSR to force the model to pay attention to the 3D geometry. They do this in two main ways: first, they strategically hide parts of the 2D image during training, so the model *has* to rely on the 3D information to figure things out. Second, they created a system that boosts the importance of the 3D information when it's most relevant to understanding the scene's spatial layout. This helps the model actively reason with the geometry.

Why it matters?

This work is important because it significantly improves the ability of AI to understand the 3D world from images and videos. Better spatial reasoning is crucial for many applications, like robotics, self-driving cars, and even helping AI understand and describe scenes more accurately. By making geometry 'matter' to these models, they can perform tasks requiring spatial understanding much more effectively.

Abstract

Empowered by large-scale training, vision-language models (VLMs) achieve strong image and video understanding, yet their ability to perform spatial reasoning in both static scenes and dynamic videos remains limited. Recent advances try to handle this limitation by injecting geometry tokens from pretrained 3D foundation models into VLMs. Nevertheless, we observe that naive token fusion followed by standard fine-tuning in this line of work often leaves such geometric cues underutilized for spatial reasoning, as VLMs tend to rely heavily on 2D visual cues. In this paper, we propose GeoSR, a framework designed to make geometry matter by encouraging VLMs to actively reason with geometry tokens. GeoSR introduces two key components: (1) Geometry-Unleashing Masking, which strategically masks portions of 2D vision tokens during training to weaken non-geometric shortcuts and force the model to consult geometry tokens for spatial reasoning; and (2) Geometry-Guided Fusion, a gated routing mechanism that adaptively amplifies geometry token contributions in regions where geometric evidence is critical. Together, these designs unleash the potential of geometry tokens for spatial reasoning tasks. Extensive experiments on both static and dynamic spatial reasoning benchmarks demonstrate that GeoSR consistently outperforms prior methods and establishes new state-of-the-art performance by effectively leveraging geometric information. The project page is available at https://suhzhang.github.io/GeoSR/.

View Paper