VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Yang Cao, Feize Wu, Dave Zhenyu Chen, Yingji Zhong, Lanqing Hong, Dan Xu

2026-03-03

VGGT-Det: Mining VGGT Internal Priors for Sensor-Geometry-Free Multi-View Indoor 3D Object Detection

Summary

This paper introduces a new method, VGGT-Det, for detecting 3D objects in indoor scenes using images from multiple cameras, but without needing precise information about where those cameras are positioned.

What's the problem?

Current 3D object detection systems that use multiple camera views require very accurate knowledge of each camera's location and orientation, which is expensive and difficult to obtain in real-world environments. This limits where these systems can be used. The goal is to create a system that works well even without this precise camera information.

What's the solution?

The researchers built upon a previous technique called Visual Geometry Grounded Transformer (VGGT) which can guess 3D information from images alone. They integrated VGGT into a transformer-based system, VGGT-Det, and added two key improvements. First, they use VGGT’s attention maps to help the system focus on the right parts of the image when looking for objects. Second, they created a way for the system to actively 'ask' VGGT for the specific 3D features it needs to identify objects, pulling information from different levels of detail within VGGT.

Why it matters?

This work is important because it makes multi-view 3D object detection more practical for real-world use. By removing the need for precise camera setup, the system can be deployed in more places and situations, and it performs significantly better than existing methods in this 'sensor-geometry-free' setting.

Abstract

Current multi-view indoor 3D object detectors rely on sensor geometry that is costly to obtain (i.e., precisely calibrated multi-view camera poses) to fuse multi-view information into a global scene representation, limiting deployment in real-world scenes. We target a more practical setting: Sensor-Geometry-Free (SG-Free) multi-view indoor 3D object detection, where there are no sensor-provided geometric inputs (multi-view poses or depth). Recent Visual Geometry Grounded Transformer (VGGT) shows that strong 3D cues can be inferred directly from images. Building on this insight, we present VGGT-Det, the first framework tailored for SG-Free multi-view indoor 3D object detection. Rather than merely consuming VGGT predictions, our method integrates VGGT encoder into a transformer-based pipeline. To effectively leverage both the semantic and geometric priors from inside VGGT, we introduce two novel key components: (i) Attention-Guided Query Generation (AG): exploits VGGT attention maps as semantic priors to initialize object queries, improving localization by focusing on object regions while preserving global spatial structure; (ii) Query-Driven Feature Aggregation (QD): a learnable See-Query interacts with object queries to 'see' what they need, and then dynamically aggregates multi-level geometric features across VGGT layers that progressively lift 2D features into 3D. Experiments show that VGGT-Det significantly surpasses the best-performing method in the SG-Free setting by 4.4 and 8.6 mAP@0.25 on ScanNet and ARKitScenes, respectively. Ablation study shows that VGGT's internally learned semantic and geometric priors can be effectively leveraged by our AG and QD.

View Paper