Concerto: Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations
Yujia Zhang, Xiaoyang Wu, Yixing Lao, Chengyao Wang, Zhuotao Tian, Naiyan Wang, Hengshuang Zhao
2025-10-28
Summary
This research introduces a new method called Concerto for teaching computers to understand 3D spaces, inspired by how humans learn. Humans often use multiple senses to grasp concepts, and then can recall them using just one sense later on, and this is what Concerto tries to mimic.
What's the problem?
Current computer systems struggle to understand 3D environments as well as humans do. Existing methods either focus on processing 2D images or 3D data separately, or simply combine them without truly understanding the relationship between the two. This leads to a lack of detailed and consistent understanding of the space, making it hard for computers to perform tasks like recognizing objects or navigating.
What's the solution?
Concerto tackles this by creating a system that learns from both 2D images and 3D data *together*. It uses a technique called 'self-distillation' within each type of data (2D and 3D) to improve understanding, and then combines them using a 'joint embedding' to create a unified representation of the space. It's a relatively simple design, but it's surprisingly effective at learning detailed spatial features. They also created a version for understanding videos and a way to connect the computer's understanding to human language.
Why it matters?
This work is important because Concerto significantly improves a computer’s ability to understand 3D scenes, achieving better results than previous methods on several standard tests. This advancement has the potential to improve many applications, such as robotics, virtual reality, and autonomous driving, by allowing computers to perceive and interact with the world more accurately and intelligently.
Abstract
Humans learn abstract concepts through multisensory synergy, and once formed, such representations can often be recalled from a single modality. Inspired by this principle, we introduce Concerto, a minimalist simulation of human concept learning for spatial cognition, combining 3D intra-modal self-distillation with 2D-3D cross-modal joint embedding. Despite its simplicity, Concerto learns more coherent and informative spatial features, as demonstrated by zero-shot visualizations. It outperforms both standalone SOTA 2D and 3D self-supervised models by 14.2% and 4.8%, respectively, as well as their feature concatenation, in linear probing for 3D scene perception. With full fine-tuning, Concerto sets new SOTA results across multiple scene understanding benchmarks (e.g., 80.7% mIoU on ScanNet). We further present a variant of Concerto tailored for video-lifted point cloud spatial understanding, and a translator that linearly projects Concerto representations into CLIP's language space, enabling open-world perception. These results highlight that Concerto emerges spatial representations with superior fine-grained geometric and semantic consistency.