Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Yi Zhou, Xuechao Zou, Shun Zhang, Kai Li, Shiying Wang, Jingming Chen, Congyan Lang, Tengfei Cao, Pin Tao, Yuanchun Shi

2026-01-06

Toward Stable Semi-Supervised Remote Sensing Segmentation via Co-Guidance and Co-Fusion

Summary

This paper introduces a new method, called Co2S, for improving how we automatically identify objects in satellite or aerial images. It focuses on 'semantic segmentation,' which means labeling every pixel in an image with what it represents – like 'building,' 'road,' or 'forest.' The goal is to do this with less manual labeling, which is time-consuming and expensive.

What's the problem?

Currently, when trying to train a computer to do this with only a small amount of labeled data (called 'semi-supervised learning'), a problem called 'pseudo-label drift' occurs. Imagine you're teaching a friend something, and they misunderstand. If you keep building on that misunderstanding, their errors will just get bigger and bigger. That's what happens here – the computer makes initial guesses on unlabeled data, and those guesses, if wrong, reinforce themselves during training, leading to inaccurate results.

What's the solution?

Co2S tackles this problem by using two different 'student' models, both based on a powerful image understanding technology called ViT. One student is trained using information linking images to text descriptions (CLIP), and the other is trained to understand images on their own (DINOv3). They learn from each other in a clever way. The text descriptions provide a broad understanding of what things *should* look like, while the self-supervised model focuses on fine details. They also use a system to combine information from both large areas and small details in the images to make more accurate predictions. This combined approach helps prevent errors from building up and keeps the model on track.

Why it matters?

This research is important because accurately labeling satellite and aerial images has many real-world applications, like urban planning, environmental monitoring, and disaster response. Reducing the need for manual labeling makes these applications more practical and affordable. Co2S demonstrates a significant improvement in performance compared to existing methods, meaning we can get more reliable results with less human effort.

Abstract

Semi-supervised remote sensing (RS) image semantic segmentation offers a promising solution to alleviate the burden of exhaustive annotation, yet it fundamentally struggles with pseudo-label drift, a phenomenon where confirmation bias leads to the accumulation of errors during training. In this work, we propose Co2S, a stable semi-supervised RS segmentation framework that synergistically fuses priors from vision-language models and self-supervised models. Specifically, we construct a heterogeneous dual-student architecture comprising two distinct ViT-based vision foundation models initialized with pretrained CLIP and DINOv3 to mitigate error accumulation and pseudo-label drift. To effectively incorporate these distinct priors, an explicit-implicit semantic co-guidance mechanism is introduced that utilizes text embeddings and learnable queries to provide explicit and implicit class-level guidance, respectively, thereby jointly enhancing semantic consistency. Furthermore, a global-local feature collaborative fusion strategy is developed to effectively fuse the global contextual information captured by CLIP with the local details produced by DINOv3, enabling the model to generate highly precise segmentation results. Extensive experiments on six popular datasets demonstrate the superiority of the proposed method, which consistently achieves leading performance across various partition protocols and diverse scenarios. Project page is available at https://xavierjiezou.github.io/Co2S/.

View Paper