SpatialBoost: Enhancing Visual Representation through Language-Guided Reasoning
Byungwoo Jeon, Dongyoung Kim, Huiwon Jang, Insoo Kim, Jinwoo Shin
2026-03-24
Summary
This paper introduces a new method called SpatialBoost that aims to make computer vision models better at understanding the 3D relationships between objects in images.
What's the problem?
Current computer vision models are really good at recognizing *what* is in an image, but they often struggle with understanding *where* things are in relation to each other and how they exist in a 3D space. They're mostly trained on flat 2D images, so they miss out on crucial spatial information that humans easily grasp, limiting their usefulness in real-world applications like robotics or augmented reality.
What's the solution?
SpatialBoost tackles this by taking information about the 3D arrangement of objects in a 2D image and turning it into a description using language. Then, it uses a powerful language model, similar to the ones powering chatbots, to 'teach' the vision model about these spatial relationships. It does this step-by-step, building up a more complex understanding of the scene. Essentially, it's giving the vision model a verbal explanation of the 3D layout.
Why it matters?
This work is important because it significantly improves the ability of vision models to understand scenes in a more human-like way. By adding this spatial awareness, the models become more accurate and effective, achieving better results on tasks that require understanding 3D space and general visual understanding, and even setting new performance records on certain benchmarks.
Abstract
Despite the remarkable success of large-scale pre-trained image representation models (i.e., vision encoders) across various vision tasks, they are predominantly trained on 2D image data and therefore often fail to capture 3D spatial relationships between objects and backgrounds in the real world, constraining their effectiveness in many downstream applications. To address this, we propose SpatialBoost, a scalable framework that enhances the spatial awareness of existing pre-trained vision encoders by injecting 3D spatial knowledge expressed in linguistic descriptions. The core idea involves converting dense 3D spatial information from 2D images into linguistic expressions, which is then used to inject such spatial knowledge into vision encoders through a Large Language Model (LLM). To this end, we adopt a multi-turn Chain-of-Thought (CoT) reasoning process that progressively incorporates dense spatial knowledge and builds hierarchical spatial understanding. To validate effectiveness, we adapt SpatialBoost to state-of-the-art vision encoders such as DINOv3, and evaluate its performance gains on a wide range of benchmarks requiring both 3D perception and general vision abilities. For instance, SpatialBoost improves DINOv3 performance from 55.9 to 59.7 mIoU on ADE20K, achieving state-of-the-art performance with 3.8% gain over the pre-trained DINOv3.