UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

Siyuan Yao, Dongxiu Liu, Taotao Li, Shengjie Li, Wenqi Ren, Xiaochun Cao

2025-12-17

UAGLNet: Uncertainty-Aggregated Global-Local Fusion Network with Cooperative CNN-Transformer for Building Extraction

Summary

This paper focuses on improving how computers automatically identify and outline buildings in images taken from above, like satellite photos.

What's the problem?

Currently, identifying buildings in these images is difficult because buildings come in so many different shapes and sizes. Existing computer programs struggle to combine information from both close-up details and the bigger picture, and often have trouble with blurry or unclear areas, leading to inaccurate building outlines.

What's the solution?

The researchers created a new system called UAGLNet. This system uses a combination of two main techniques: first, it uses both traditional image processing and a newer technique called transformers to capture both local details and global context. Second, it actively estimates how *uncertain* the computer is about each part of the image, and uses that uncertainty to improve the final building outlines, making them more precise, especially in tricky areas. They essentially built a system that 'knows what it doesn't know' and adjusts accordingly.

Why it matters?

This research is important because accurately identifying buildings from aerial images has many real-world applications, such as urban planning, disaster response, and map creation. A more accurate system means better data for these applications, leading to more informed decisions and more effective responses to real-world challenges.

Abstract

Building extraction from remote sensing images is a challenging task due to the complex structure variations of the buildings. Existing methods employ convolutional or self-attention blocks to capture the multi-scale features in the segmentation models, while the inherent gap of the feature pyramids and insufficient global-local feature integration leads to inaccurate, ambiguous extraction results. To address this issue, in this paper, we present an Uncertainty-Aggregated Global-Local Fusion Network (UAGLNet), which is capable to exploit high-quality global-local visual semantics under the guidance of uncertainty modeling. Specifically, we propose a novel cooperative encoder, which adopts hybrid CNN and transformer layers at different stages to capture the local and global visual semantics, respectively. An intermediate cooperative interaction block (CIB) is designed to narrow the gap between the local and global features when the network becomes deeper. Afterwards, we propose a Global-Local Fusion (GLF) module to complementarily fuse the global and local representations. Moreover, to mitigate the segmentation ambiguity in uncertain regions, we propose an Uncertainty-Aggregated Decoder (UAD) to explicitly estimate the pixel-wise uncertainty to enhance the segmentation accuracy. Extensive experiments demonstrate that our method achieves superior performance to other state-of-the-art methods. Our code is available at https://github.com/Dstate/UAGLNet

View Paper