IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Hao Li, Zhengyu Zou, Fangfu Liu, Xuanyang Zhang, Fangzhou Hong, Yukang Cao, Yushi Lan, Manyuan Zhang, Gang Yu, Dingwen Zhang, Ziwei Liu

2025-10-28

IGGT: Instance-Grounded Geometry Transformer for Semantic 3D Reconstruction

Summary

This paper introduces a new way for computers to understand 3D scenes from 2D images, focusing on how both the shape of objects and what those objects *are* are connected in our minds.

What's the problem?

Currently, most computer systems tackle 3D scene understanding in two separate steps: first, they try to reconstruct the 3D geometry, and then they try to figure out what objects are present. This is a problem because humans don't think this way – we understand shape and object identity simultaneously. Existing attempts to combine these steps often just link a 3D model to a language model, which limits the system's overall ability to adapt and perform well on different tasks.

What's the solution?

The researchers developed a system called InstanceGrounded Geometry Transformer, or IGGT. This system is a single, unified 'brain' that learns to understand both the 3D structure and the objects within a scene at the same time, using only 2D images as input. They achieved this by creating a special learning method that encourages the system to build a combined representation of geometry and object identity. To help train IGGT, they also created a new, large dataset called InsScene-15K, which includes detailed information about objects in images, like their 3D shapes and instance masks.

Why it matters?

This work is important because it moves closer to how humans understand 3D scenes. By unifying geometry and object understanding, the system can potentially generalize better to new situations and perform more accurately on tasks like robotic navigation or virtual reality, ultimately leading to more intelligent and adaptable 3D-aware AI.

Abstract

Humans naturally perceive the geometric structure and semantic content of a 3D world as intertwined dimensions, enabling coherent and accurate understanding of complex scenes. However, most prior approaches prioritize training large geometry models for low-level 3D reconstruction and treat high-level spatial understanding in isolation, overlooking the crucial interplay between these two fundamental aspects of 3D-scene analysis, thereby limiting generalization and leading to poor performance in downstream 3D understanding tasks. Recent attempts have mitigated this issue by simply aligning 3D models with specific language models, thus restricting perception to the aligned model's capacity and limiting adaptability to downstream tasks. In this paper, we propose InstanceGrounded Geometry Transformer (IGGT), an end-to-end large unified transformer to unify the knowledge for both spatial reconstruction and instance-level contextual understanding. Specifically, we design a 3D-Consistent Contrastive Learning strategy that guides IGGT to encode a unified representation with geometric structures and instance-grounded clustering through only 2D visual inputs. This representation supports consistent lifting of 2D visual inputs into a coherent 3D scene with explicitly distinct object instances. To facilitate this task, we further construct InsScene-15K, a large-scale dataset with high-quality RGB images, poses, depth maps, and 3D-consistent instance-level mask annotations with a novel data curation pipeline.

View Paper