Segment Anything with Multiple Modalities

Aoran Xiao, Weihao Xuan, Heli Qi, Yun Xing, Naoto Yokoya, Shijian Lu

2024-08-20

Segment Anything with Multiple Modalities

Summary

This paper introduces MM-SAM, an enhanced version of the Segment Anything Model (SAM) that can handle various types of data from different sensors for better scene segmentation.

What's the problem?

The original SAM model was designed primarily for single-modal RGB images, which limits its ability to work with other types of data collected from sensors like LiDAR or thermal cameras. This restriction means that it cannot effectively segment scenes that use multiple types of sensor data, reducing its usefulness in real-world applications.

What's the solution?

MM-SAM expands upon SAM by incorporating two main features: unsupervised cross-modal transfer and weakly-supervised multi-modal fusion. These features allow MM-SAM to adapt to different sensor types and combine data from multiple sources more effectively. This model can process non-RGB sensor data and improve segmentation accuracy by learning from various inputs without needing extensive labeled data.

Why it matters?

This research is important because it enhances the capabilities of segmentation models, making them more versatile and applicable to a wider range of real-world scenarios. By improving how these models can process and understand multi-modal data, MM-SAM can significantly benefit fields like autonomous driving, robotics, and environmental monitoring.

Abstract

Robust and accurate segmentation of scenes has become one core functionality in various visual recognition and navigation tasks. This has inspired the recent development of Segment Anything Model (SAM), a foundation model for general mask segmentation. However, SAM is largely tailored for single-modal RGB images, limiting its applicability to multi-modal data captured with widely-adopted sensor suites, such as LiDAR plus RGB, depth plus RGB, thermal plus RGB, etc. We develop MM-SAM, an extension and expansion of SAM that supports cross-modal and multi-modal processing for robust and enhanced segmentation with different sensor suites. MM-SAM features two key designs, namely, unsupervised cross-modal transfer and weakly-supervised multi-modal fusion, enabling label-efficient and parameter-efficient adaptation toward various sensor modalities. It addresses three main challenges: 1) adaptation toward diverse non-RGB sensors for single-modal processing, 2) synergistic processing of multi-modal data via sensor fusion, and 3) mask-free training for different downstream tasks. Extensive experiments show that MM-SAM consistently outperforms SAM by large margins, demonstrating its effectiveness and robustness across various sensors and data modalities.

View Paper