DynamicVis: An Efficient and General Visual Foundation Model for Remote Sensing Image Understanding

Keyan Chen, Chenyang Liu, Bowen Chen, Wenyuan Li, Zhengxia Zou, Zhenwei Shi

2025-03-25

DynamicVis: An Efficient and General Visual Foundation Model for Remote
Sensing Image Understanding

Summary

This paper is about creating a new AI model that's really good at understanding images taken from satellites, even when those images are super detailed and show lots of different things.

What's the problem?

Existing AI models struggle to work with these super detailed satellite images because they're too big and complex. They also have trouble adapting to different tasks, like identifying different types of objects in the images.

What's the solution?

The researchers created a new AI model called DynamicVis that's specifically designed to handle these challenges. It uses a special way of looking at the images that allows it to focus on the important parts while still being efficient. It's also trained to be adaptable to different tasks.

Why it matters?

This work matters because it can help us better understand our planet using satellite imagery, which is important for things like monitoring the environment, managing resources, and responding to disasters.

Abstract

The advancement of remote sensing technology has improved the spatial resolution of satellite imagery, facilitating more detailed visual representations for diverse interpretations. However, existing methods exhibit limited generalization capabilities across varied applications. While some contemporary foundation models demonstrate potential, they are hindered by insufficient cross-task adaptability and primarily process low-resolution imagery of restricted sizes, thus failing to fully exploit high-resolution data or leverage comprehensive large-scene semantics. Crucially, remote sensing imagery differs fundamentally from natural images, as key foreground targets (eg., maritime objects, artificial structures) often occupy minimal spatial proportions (~1%) and exhibit sparse distributions. Efficiently modeling cross-task generalizable knowledge from lengthy 2D tokens (~100,000) poses a significant challenge yet remains critical for remote sensing image understanding. Motivated by the selective attention mechanisms inherent to the human visual system, we propose DynamicVis, a dynamic visual perception foundation model for remote sensing imagery. The framework integrates a novel dynamic region perception backbone based on the selective state space model, which strategically balances localized detail extraction with global contextual integration, enabling computationally efficient encoding of large-scale data while maintaining architectural scalability. To enhance cross-task knowledge transferring, we introduce a multi-instance learning paradigm utilizing meta-embedding representations, trained on million-scale region-level annotations. Evaluations across nine downstream tasks demonstrate the model's versatility. DynamicVis achieves multi-level feature modeling with exceptional efficiency, processing (2048x2048) pixels with 97 ms latency (6% of ViT's) and 833 MB GPU memory (3% of ViT's).

View Paper