Asking like Socrates: Socrates helps VLMs understand remote sensing images
Run Shao, Ziyu Li, Zhaoyang Zhang, Linrui Xu, Xinran He, Hongyuan Yuan, Bolei He, Yongxing Dai, Yiming Yan, Yijun Chen, Wang Guo, Haifeng Li
2025-12-02
Summary
This paper focuses on improving how AI models understand and reason about images from satellites and aerial views, specifically addressing a problem where they often 'hallucinate' reasoning instead of actually looking at the image for answers.
What's the problem?
Current AI models, while good at combining images and text, often struggle with detailed remote sensing images like those from satellites. They tend to quickly glance at the image, form a general idea, and then generate an answer based on what *sounds* logical rather than what's actually visible in the image. This is called 'pseudo reasoning' and is caused by the 'Glance Effect' – the image is so large and complex that the model doesn't fully analyze it before trying to answer.
What's the solution?
The researchers developed a new approach called RS-EoT (Remote Sensing Evidence-of-Thought). This system forces the AI to think like a detective, iteratively asking questions, looking for specific visual evidence in the image to support its reasoning, and then refining its answer. They used a 'SocraticAgent' – essentially two AI agents playing against each other, one proposing reasoning and the other challenging it to find visual proof. They also used a two-step training process with reinforcement learning, first focusing on pinpointing objects in images and then applying that skill to answer broader questions about the images.
Why it matters?
This work is important because it makes AI much more reliable when analyzing satellite and aerial imagery. This has huge implications for things like environmental monitoring, disaster response, and urban planning, where accurate interpretation of these images is crucial. By forcing the AI to actually *see* the evidence, instead of just making things up, we can trust its conclusions more and make better decisions.
Abstract
Recent multimodal reasoning models, inspired by DeepSeek-R1, have significantly advanced vision-language systems. However, in remote sensing (RS) tasks, we observe widespread pseudo reasoning: models narrate the process of reasoning rather than genuinely reason toward the correct answer based on visual evidence. We attribute this to the Glance Effect, where a single, coarse perception of large-scale RS imagery results in incomplete understanding and reasoning based on linguistic self-consistency instead of visual evidence. To address this, we propose RS-EoT (Remote Sensing Evidence-of-Thought), a language-driven, iterative visual evidence-seeking paradigm. To instill this paradigm, we propose SocraticAgent, a self-play multi-agent system that synthesizes reasoning traces via alternating cycles of reasoning and visual inspection. To enhance and generalize these patterns, we propose a two-stage progressive RL strategy: first, RL on fine-grained Grounding tasks to enhance RS-EoT capabilities, followed by RL on RS VQA to generalize to broader understanding scenarios. Experiments show RS-EoT achieves state-of-the-art performance on multiple RS VQA and grounding benchmarks. Analyses reveal clear iterative cycles of reasoning and evidence seeking, confirming RS-EoT mitigates the Glance Effect and enables genuine evidence-grounded reasoning. Our code, data, and models are available at https://geox-lab.github.io/Asking_like_Socrates