Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Song Wang, Lingdong Kong, Xiaolu Liu, Hao Shi, Wentong Li, Jianke Zhu, Steven C. H. Hoi

2026-01-01

Forging Spatial Intelligence: A Roadmap of Multi-Modal Data Pre-Training for Autonomous Systems

Summary

This paper is about building 'Spatial Intelligence' in robots and self-driving cars by helping them understand the world around them using information from different sensors like cameras and LiDAR.

What's the problem?

Currently, artificial intelligence models are really good at processing information from *one* type of sensor at a time, like just a camera or just LiDAR. The big challenge is combining information from all the different sensors a robot has to create a complete and accurate understanding of its surroundings, similar to how humans use their senses together.

What's the solution?

The researchers created a way to train these AI models using data from multiple sensors at once. They looked at different techniques for doing this, categorized them, and figured out what works best. They also explored adding text information and understanding how spaces are occupied to help the AI make better decisions about navigating and interacting with the world. They also pinpointed areas where improvements are needed, like making the process faster and more efficient.

Why it matters?

This research is important because it's a step towards creating truly intelligent robots and self-driving cars that can reliably operate in the real world. If we can give these machines a strong understanding of their environment, they'll be safer, more efficient, and able to handle unexpected situations, ultimately leading to more advanced autonomous systems.

Abstract

The rapid advancement of autonomous systems, including self-driving vehicles and drones, has intensified the need to forge true Spatial Intelligence from multi-modal onboard sensor data. While foundation models excel in single-modal contexts, integrating their capabilities across diverse sensors like cameras and LiDAR to create a unified understanding remains a formidable challenge. This paper presents a comprehensive framework for multi-modal pre-training, identifying the core set of techniques driving progress toward this goal. We dissect the interplay between foundational sensor characteristics and learning strategies, evaluating the role of platform-specific datasets in enabling these advancements. Our central contribution is the formulation of a unified taxonomy for pre-training paradigms: ranging from single-modality baselines to sophisticated unified frameworks that learn holistic representations for advanced tasks like 3D object detection and semantic occupancy prediction. Furthermore, we investigate the integration of textual inputs and occupancy representations to facilitate open-world perception and planning. Finally, we identify critical bottlenecks, such as computational efficiency and model scalability, and propose a roadmap toward general-purpose multi-modal foundation models capable of achieving robust Spatial Intelligence for real-world deployment.

View Paper