TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Yan Shu, Bin Ren, Zhitong Xiong, Xiao Xiang Zhu, Begüm Demir, Nicu Sebe, Paolo Rota

2026-03-23

TerraScope: Pixel-Grounded Visual Reasoning for Earth Observation

Summary

This paper introduces a new vision-language model called TerraScope designed to better understand images taken of Earth, like satellite photos, and answer questions about them with specific details.

What's the problem?

Current vision-language models aren't very good at tasks that require precise understanding of *where* things are in an Earth observation image. They struggle to connect what they 'see' in the image to specific pixels, and they also have trouble tracking changes over time using multiple images. Basically, they can say something is happening, but not *exactly* where or how it changed.

What's the solution?

The researchers created TerraScope, a model that can work with different types of Earth images (like regular photos or radar images) and combine information from them. It can also analyze a series of images taken at different times to understand changes. To help train this model, they also built a large dataset called Terra-CoT with a million examples and a new set of tests, TerraScope-Bench, to specifically measure how well the model can pinpoint locations and track changes accurately.

Why it matters?

This work is important because it improves our ability to automatically analyze Earth observation data. This has lots of real-world applications, like monitoring deforestation, tracking urban growth, responding to natural disasters, and understanding climate change, all with more precise and reliable information.

Abstract

Vision-language models (VLMs) have shown promise in earth observation (EO), yet they struggle with tasks that require grounding complex spatial reasoning in precise pixel-level visual representations. To address this problem, we introduce TerraScope, a unified VLM that delivers pixel-grounded geospatial reasoning with two key capabilities: (1) modality-flexible reasoning: it handles single-modality inputs (optical or SAR) and adaptively fuses different modalities into the reasoning process when both are available; (2) multi-temporal reasoning: it integrates temporal sequences for change analysis across multiple time points. In addition, we curate Terra-CoT, a large-scale dataset containing 1 million samples with pixel-level masks embedded in reasoning chains across multiple sources. We also propose TerraScope-Bench, the first benchmark for pixel-grounded geospatial reasoning with six sub-tasks that evaluates both answer accuracy and mask quality to ensure authentic pixel-grounded reasoning. Experiments show that TerraScope significantly outperforms existing VLMs on pixel-grounded geospatial reasoning while providing interpretable visual evidence.

View Paper