CrossOver: 3D Scene Cross-Modal Alignment

Sayan Deb Sarkar, Ondrej Miksik, Marc Pollefeys, Daniel Barath, Iro Armeni

2025-02-24

CrossOver: 3D Scene Cross-Modal Alignment

Summary

This paper talks about CrossOver, a new AI system that can understand 3D scenes by combining different types of information like photos, 3D scans, and text descriptions in a flexible way.

What's the problem?

Current methods for understanding 3D scenes using multiple types of data (like images and 3D models) often need all the data to be perfectly matched and available, which isn't always possible in real-world situations.

What's the solution?

The researchers created CrossOver, which uses special encoders for different types of data and a multi-step training process. This allows the system to understand 3D scenes even when some types of data are missing. It can work with photos, 3D scans, computer models, floor plans, and text descriptions without needing them to be perfectly aligned.

Why it matters?

This matters because it makes 3D scene understanding more practical for real-world use. It could help improve things like virtual reality, robotics, and automated systems that need to understand complex 3D environments. The flexibility of CrossOver means it can work in situations where not all types of data are available, making it more useful in everyday applications.

Abstract

Multi-modal 3D object understanding has gained significant attention, yet current approaches often assume complete data availability and rigid alignment across all modalities. We present CrossOver, a novel framework for cross-modal 3D scene understanding via flexible, scene-level modality alignment. Unlike traditional methods that require aligned modality data for every object instance, CrossOver learns a unified, modality-agnostic embedding space for scenes by aligning modalities - RGB images, point clouds, CAD models, floorplans, and text descriptions - with relaxed constraints and without explicit object semantics. Leveraging dimensionality-specific encoders, a multi-stage training pipeline, and emergent cross-modal behaviors, CrossOver supports robust scene retrieval and object localization, even with missing modalities. Evaluations on ScanNet and 3RScan datasets show its superior performance across diverse metrics, highlighting adaptability for real-world applications in 3D scene understanding.

View Paper