Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent Spaces

Souhail Hadgi, Luca Moschella, Andrea Santilli, Diego Gomez, Qixing Huang, Emanuele Rodolà, Simone Melzi, Maks Ovsjanikov

2025-03-11

Escaping Plato's Cave: Towards the Alignment of 3D and Text Latent
Spaces

Summary

This paper talks about connecting 3D models and text descriptions by finding hidden similarities in how computers understand them, like translating between two languages that describe the same thing.

What's the problem?

Computers struggle to link 3D shapes (like a chair model) with text descriptions (like 'wooden chair with armrests') because they process these formats differently and don’t naturally 'speak the same language'.

What's the solution?

The researchers found that by focusing on smaller, shared parts of how computers represent 3D and text data (like comparing key features instead of everything), they could teach the system to match 3D models to text descriptions more accurately.

Why it matters?

This helps AI tools better understand and link 3D designs (like video game assets or 3D-printed objects) to their real-world descriptions, making it easier to search, design, or create things using both formats.

Abstract

Recent works have shown that, when trained at scale, uni-modal 2D vision and text encoders converge to learned features that share remarkable structural properties, despite arising from different representations. However, the role of 3D encoders with respect to other modalities remains unexplored. Furthermore, existing 3D foundation models that leverage large datasets are typically trained with explicit alignment objectives with respect to frozen encoders from other representations. In this work, we investigate the possibility of a posteriori alignment of representations obtained from uni-modal 3D encoders compared to text-based feature spaces. We show that naive post-training feature alignment of uni-modal text and 3D encoders results in limited performance. We then focus on extracting subspaces of the corresponding feature spaces and discover that by projecting learned representations onto well-chosen lower-dimensional subspaces the quality of alignment becomes significantly higher, leading to improved accuracy on matching and retrieval tasks. Our analysis further sheds light on the nature of these shared subspaces, which roughly separate between semantic and geometric data representations. Overall, ours is the first work that helps to establish a baseline for post-training alignment of 3D uni-modal and text feature spaces, and helps to highlight both the shared and unique properties of 3D data compared to other representations.

View Paper