N3D-VLM: Native 3D Grounding Enables Accurate Spatial Reasoning in Vision-Language Models
Yuxin Wang, Lei Ke, Boqiang Zhang, Tianyuan Qu, Hanxun Yu, Zhenpeng Huang, Meng Yu, Dan Xu, Dong Yu
2025-12-19
Summary
This paper introduces a new system called N3D-VLM that helps computers understand images and questions about those images, but with a focus on truly understanding the 3D structure of the scene, not just what it looks like in a 2D picture.
What's the problem?
Current computer models that answer questions about images struggle with understanding depth and spatial relationships. They see a flat picture, but don't naturally 'get' how objects are positioned in 3D space, making it hard to answer questions like 'What's to the left of the chair?' or 'How far is the table from the sofa?' They rely on 2D information and miss crucial 3D cues.
What's the solution?
The researchers created N3D-VLM, which is designed to perceive objects in 3D directly. Instead of just looking at a regular image, it uses depth information to build a 3D understanding of the scene. They also created a way to automatically generate a large amount of 3D training data from existing 2D images, making it possible to train the model effectively. This allows the model to not only find objects in 3D space when asked, but also to reason about their positions relative to each other.
Why it matters?
This work is important because it moves us closer to computers that can truly 'see' and understand the world like humans do. By giving models a 3D understanding of scenes, they can answer more complex questions, perform more accurate tasks, and ultimately be more helpful in applications like robotics, augmented reality, and virtual assistants.
Abstract
While current multimodal models can answer questions based on 2D images, they lack intrinsic 3D object perception, limiting their ability to comprehend spatial relationships and depth cues in 3D scenes. In this work, we propose N3D-VLM, a novel unified framework that seamlessly integrates native 3D object perception with 3D-aware visual reasoning, enabling both precise 3D grounding and interpretable spatial understanding. Unlike conventional end-to-end models that directly predict answers from RGB/RGB-D inputs, our approach equips the model with native 3D object perception capabilities, enabling it to directly localize objects in 3D space based on textual descriptions. Building upon accurate 3D object localization, the model further performs explicit reasoning in 3D, achieving more interpretable and structured spatial understanding. To support robust training for these capabilities, we develop a scalable data construction pipeline that leverages depth estimation to lift large-scale 2D annotations into 3D space, significantly increasing the diversity and coverage for 3D object grounding data, yielding over six times larger than the largest existing single-image 3D detection dataset. Moreover, the pipeline generates spatial question-answering datasets that target chain-of-thought (CoT) reasoning in 3D, facilitating joint training for both 3D object localization and 3D spatial reasoning. Experimental results demonstrate that our unified framework not only achieves state-of-the-art performance on 3D grounding tasks, but also consistently surpasses existing methods in 3D spatial reasoning in vision-language model.