SegEarth-OV3: Exploring SAM 3 for Open-Vocabulary Semantic Segmentation in Remote Sensing Images
Kaiyu Li, Shengqi Zhang, Yupeng Deng, Zhi Wang, Deyu Meng, Xiangyong Cao
2025-12-10
Summary
This paper explores using a new AI model called Segment Anything Model 3 (SAM 3) for identifying objects in satellite and aerial images, without needing to train the model specifically for this task.
What's the problem?
Current methods for automatically labeling objects in images, especially detailed satellite images, often struggle with accurately pinpointing the location of small or numerous objects. Many approaches also require complicated setups with multiple steps, making them difficult to use. Existing methods rely heavily on a system called CLIP, which isn't always ideal for the specific challenges of remote sensing data.
What's the solution?
The researchers tested SAM 3 on remote sensing images. They combined the information from two parts of SAM 3 – one that broadly identifies areas and another that focuses on individual objects – to get better overall coverage. They also used a feature of SAM 3 that estimates how confident it is in its predictions to filter out incorrect labels, which is important because SAM 3 can recognize a huge number of potential objects.
Why it matters?
This work shows that SAM 3 has the potential to be a powerful tool for automatically understanding satellite and aerial imagery. Because it doesn't require specific training, it could be used to quickly and easily analyze images for various applications like mapping, environmental monitoring, and urban planning.
Abstract
Most existing methods for training-free Open-Vocabulary Semantic Segmentation (OVSS) are based on CLIP. While these approaches have made progress, they often face challenges in precise localization or require complex pipelines to combine separate modules, especially in remote sensing scenarios where numerous dense and small targets are present. Recently, Segment Anything Model 3 (SAM 3) was proposed, unifying segmentation and recognition in a promptable framework. In this paper, we present a preliminary exploration of applying SAM 3 to the remote sensing OVSS task without any training. First, we implement a mask fusion strategy that combines the outputs from SAM 3's semantic segmentation head and the Transformer decoder (instance head). This allows us to leverage the strengths of both heads for better land coverage. Second, we utilize the presence score from the presence head to filter out categories that do not exist in the scene, reducing false positives caused by the vast vocabulary sizes and patch-level processing in geospatial scenes. We evaluate our method on extensive remote sensing datasets. Experiments show that this simple adaptation achieves promising performance, demonstrating the potential of SAM 3 for remote sensing OVSS. Our code is released at https://github.com/earth-insights/SegEarth-OV-3.