3D Aware Region Prompted Vision Language Model

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, Sifei Liu

2025-09-17

3D Aware Region Prompted Vision Language Model

Summary

This paper introduces a new vision-language model called SR-3D that’s designed to understand images and 3D data together, bridging the gap between how we see the world in 2D pictures and how it exists in 3D space.

What's the problem?

Traditionally, understanding scenes required either lots of labeled 2D images or detailed 3D data, which can be hard to get. Existing methods struggled to connect information from different viewpoints or to accurately understand spatial relationships, especially when objects weren't visible in every image. It was difficult to easily point out specific areas of interest in both 2D and 3D without a ton of manual labeling.

What's the solution?

The researchers created SR-3D, which uses a shared 'visual token space' to combine information from 2D images and 3D data. They essentially add 3D location information to the 2D image features, allowing the model to reason about space more effectively. This lets you highlight regions of interest using bounding boxes, outlines, or even directly in the 3D data, without needing to label everything in multiple views. The model can then use what it knows about 2D images to help it understand the 3D structure, even if an object isn't visible in all views.

Why it matters?

This work is important because it improves scene understanding by combining the strengths of 2D and 3D data. It achieves top performance on various tests and can even work with regular videos without needing specific 3D information, accurately figuring out how objects relate to each other in space and their sizes. This could be useful for things like robotics, virtual reality, and more realistic computer vision applications.

Abstract

We present Spatial Region 3D (SR-3D) aware vision-language model that connects single-view 2D images and multi-view 3D data through a shared visual token space. SR-3D supports flexible region prompting, allowing users to annotate regions with bounding boxes, segmentation masks on any frame, or directly in 3D, without the need for exhaustive multi-frame labeling. We achieve this by enriching 2D visual features with 3D positional embeddings, which allows the 3D model to draw upon strong 2D priors for more accurate spatial reasoning across frames, even when objects of interest do not co-occur within the same view. Extensive experiments on both general 2D vision language and specialized 3D spatial benchmarks demonstrate that SR-3D achieves state-of-the-art performance, underscoring its effectiveness for unifying 2D and 3D representation space on scene understanding. Moreover, we observe applicability to in-the-wild videos without sensory 3D inputs or ground-truth 3D annotations, where SR-3D accurately infers spatial relationships and metric measurements.

View Paper