Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

Mahmoud El Hussieni, Bahadır K. Güntürk, Hasan F. Ateş, Oğuz Hanoğlu

2025-11-03

Mask-to-Height: A YOLOv11-Based Architecture for Joint Building Instance Segmentation and Height Classification from Satellite Imagery

Summary

This paper investigates a new version of a computer vision model, called YOLOv11, and how well it can automatically identify and categorize buildings in satellite images, specifically figuring out their height.

What's the problem?

Automatically understanding what's in satellite images of cities is really hard, especially when it comes to buildings. We need to be able to not only find each building (instance segmentation) but also classify how tall it is. Existing methods often struggle with complex city layouts, buildings blocking each other, and accurately identifying taller, less common buildings.

What's the solution?

The researchers used YOLOv11, a recently improved version of the popular YOLO object detection system. This model is designed to be more efficient at combining information from different parts of an image, leading to more accurate building identification and height classification. They tested it on a large dataset of over 125,000 buildings across twelve different cities, measuring how well it performed using metrics like precision, recall, and a combined score called mAP. They specifically looked at how well it handled tricky situations like overlapping buildings and rare, tall structures.

Why it matters?

This work is important because accurately mapping buildings and their heights is crucial for things like city planning, creating 3D models of cities, and monitoring infrastructure. YOLOv11's ability to quickly and accurately perform this task, even in complex urban environments, makes it a valuable tool for creating up-to-date and detailed maps, and could help with things like disaster response and urban development.

Abstract

Accurate building instance segmentation and height classification are critical for urban planning, 3D city modeling, and infrastructure monitoring. This paper presents a detailed analysis of YOLOv11, the recent advancement in the YOLO series of deep learning models, focusing on its application to joint building extraction and discrete height classification from satellite imagery. YOLOv11 builds on the strengths of earlier YOLO models by introducing a more efficient architecture that better combines features at different scales, improves object localization accuracy, and enhances performance in complex urban scenes. Using the DFC2023 Track 2 dataset -- which includes over 125,000 annotated buildings across 12 cities -- we evaluate YOLOv11's performance using metrics such as precision, recall, F1 score, and mean average precision (mAP). Our findings demonstrate that YOLOv11 achieves strong instance segmentation performance with 60.4\% mAP@50 and 38.3\% mAP@50--95 while maintaining robust classification accuracy across five predefined height tiers. The model excels in handling occlusions, complex building shapes, and class imbalance, particularly for rare high-rise structures. Comparative analysis confirms that YOLOv11 outperforms earlier multitask frameworks in both detection accuracy and inference speed, making it well-suited for real-time, large-scale urban mapping. This research highlights YOLOv11's potential to advance semantic urban reconstruction through streamlined categorical height modeling, offering actionable insights for future developments in remote sensing and geospatial intelligence.

View Paper