Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Yang You, Yixin Li, Congyue Deng, Yue Wang, Leonidas Guibas

2025-01-27

Multiview Equivariance Improves 3D Correspondence Understanding with Minimal Feature Finetuning

Summary

This paper talks about improving how AI models understand 3D spaces in images. It focuses on making Vision Transformer (ViT) models better at recognizing objects from different angles and understanding how they relate to each other in 3D space.

What's the problem?

While current AI models are great at understanding 2D images, they struggle with 3D relationships. It's like they can recognize a car in a photo, but have trouble understanding how that car would look from different angles or how it relates to other objects around it in 3D space.

What's the solution?

The researchers did two main things. First, they tested how well ViT models could understand 3D relationships by checking if they recognized the same object from different viewpoints. Then, they created a new way to train these models that helps them understand 3D relationships better. Surprisingly, they found that even a tiny bit of this special training - just looking at one object one time - made the models much better at understanding 3D spaces.

Why it matters?

This matters because it could make AI much better at tasks that involve 3D understanding, like helping robots navigate rooms, improving virtual reality experiences, or even assisting in medical imaging. By making AI better at understanding 3D spaces with just a little extra training, we could see big improvements in many areas of technology without needing to completely rebuild our AI systems.

Abstract

Vision foundation models, particularly the ViT family, have revolutionized image understanding by providing rich semantic features. However, despite their success in 2D comprehension, their abilities on grasping 3D spatial relationships are still unclear. In this work, we evaluate and enhance the 3D awareness of ViT-based models. We begin by systematically assessing their ability to learn 3D equivariant features, specifically examining the consistency of semantic embeddings across different viewpoints. Our findings indicate that improved 3D equivariance leads to better performance on various downstream tasks, including pose estimation, tracking, and semantic transfer. Building on this insight, we propose a simple yet effective finetuning strategy based on 3D correspondences, which significantly enhances the 3D correspondence understanding of existing vision models. Remarkably, even finetuning on a single object for just one iteration results in substantial performance gains. All code and resources will be made publicly available to support further advancements in 3D-aware vision models. Our code is available at https://github.com/qq456cvb/3DCorrEnhance.

View Paper