GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Yikun Wang, Zuyan Liu, Ziyi Wang, Pengfei Liu, Han Hu, Yongming Rao

2025-11-24

GeoVista: Web-Augmented Agentic Visual Reasoning for Geolocalization

Summary

This paper introduces a new challenge and model for 'agentic visual reasoning,' which is about building AI that can not only 'see' and understand images but also actively investigate and solve problems using tools like searching the internet. They focus on the task of geolocalization – figuring out where a picture was taken.

What's the problem?

Current AI models that try to reason with images are often tested using simple tasks or don't require real-world knowledge. Existing geolocalization tests weren't challenging enough, lacking high-quality images and the need for complex reasoning. To really test an AI's ability to think and explore, you need a task that demands both visual understanding *and* the ability to gather information from the web to confirm ideas.

What's the solution?

The researchers created a new, more difficult geolocalization test called GeoBench, which includes detailed photos, panoramic views, and even satellite images. They also built a new AI model called GeoVista. GeoVista doesn't just look at a picture; it can zoom in on details and search the internet for clues. It was trained in two steps: first, it learned basic reasoning skills, and then it improved those skills through a reward system that encouraged it to use geographical information effectively.

Why it matters?

This work is important because it pushes AI beyond simply recognizing objects in images. It shows how AI can be built to actively *investigate* and *reason* like a human would when trying to solve a problem. GeoVista performs as well as some of the most advanced, but not publicly available, AI models, demonstrating a significant step forward in creating more capable and generally useful AI systems.

Abstract

Current research on agentic visual reasoning enables deep multimodal understanding but primarily focuses on image manipulation tools, leaving a gap toward more general-purpose agentic models. In this work, we revisit the geolocalization task, which requires not only nuanced visual grounding but also web search to confirm or refine hypotheses during reasoning. Since existing geolocalization benchmarks fail to meet the need for high-resolution imagery and the localization challenge for deep agentic reasoning, we curate GeoBench, a benchmark that includes photos and panoramas from around the world, along with a subset of satellite images of different cities to rigorously evaluate the geolocalization ability of agentic models. We also propose GeoVista, an agentic model that seamlessly integrates tool invocation within the reasoning loop, including an image-zoom-in tool to magnify regions of interest and a web-search tool to retrieve related web information. We develop a complete training pipeline for it, including a cold-start supervised fine-tuning (SFT) stage to learn reasoning patterns and tool-use priors, followed by a reinforcement learning (RL) stage to further enhance reasoning ability. We adopt a hierarchical reward to leverage multi-level geographical information and improve overall geolocalization performance. Experimental results show that GeoVista surpasses other open-source agentic models on the geolocalization task greatly and achieves performance comparable to closed-source models such as Gemini-2.5-flash and GPT-5 on most metrics.

View Paper