SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image Interpretation

Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, Qingyun Pan

2025-02-13

SARChat-Bench-2M: A Multi-Task Vision-Language Benchmark for SAR Image
Interpretation

Summary

This paper talks about SARChat-Bench-2M, a new dataset and testing system for helping AI understand and interpret synthetic aperture radar (SAR) images. It's like creating a huge picture book with special radar images and detailed explanations to teach AI how to 'see' and understand these complex images.

What's the problem?

AI has gotten really good at understanding regular photos and text, but it struggles with specialized images like SAR, which are used in things like satellite imaging and remote sensing. This is because AI hasn't had enough examples of SAR images to learn from, especially ones with good explanations of what's in the image.

What's the solution?

The researchers created SARChat-2M, a massive collection of about 2 million SAR images paired with detailed text explanations. This dataset covers lots of different scenarios and includes precise information about the objects in each image. They used this to test 16 different AI models on tasks like understanding what's in the SAR images and finding specific objects. They also made sure their dataset could be used to teach AI how to have conversations about SAR images.

Why it matters?

This matters because SAR images are super important for things like monitoring the environment, urban planning, and even military operations. By helping AI understand these images better, we could make big improvements in how we use satellite data. It could lead to faster and more accurate analysis of SAR images, which could help with everything from predicting natural disasters to tracking changes in forests or cities over time. Plus, the way they made this dataset could be used as a model for teaching AI about other types of specialized images in the future.

Abstract

In the field of synthetic aperture radar (SAR) remote sensing image interpretation, although Vision language models (VLMs) have made remarkable progress in natural language processing and image understanding, their applications remain limited in professional domains due to insufficient domain expertise. This paper innovatively proposes the first large-scale multimodal dialogue dataset for SAR images, named SARChat-2M, which contains approximately 2 million high-quality image-text pairs, encompasses diverse scenarios with detailed target annotations. This dataset not only supports several key tasks such as visual understanding and object detection tasks, but also has unique innovative aspects: this study develop a visual-language dataset and benchmark for the SAR domain, enabling and evaluating VLMs' capabilities in SAR image interpretation, which provides a paradigmatic framework for constructing multimodal datasets across various remote sensing vertical domains. Through experiments on 16 mainstream VLMs, the effectiveness of the dataset has been fully verified, and the first multi-task dialogue <PRE_TAG>benchmark</POST_TAG> in the SAR field has been successfully established. The project will be released at https://github.com/JimmyMa99/SARChat, aiming to promote the in-depth development and wide application of SAR visual language models.

View Paper