Geolocation with Real Human Gameplay Data: A Large-Scale Dataset and Human-Like Reasoning Framework
Zirui Song, Jingpu Yang, Yuan Huang, Jonathan Tonglet, Zeyu Zhang, Tao Cheng, Meng Fang, Iryna Gurevych, Xiuying Chen
2025-02-21
Summary
This paper talks about a new way to teach computers how to figure out where a photo was taken, using a huge collection of data from a geography game and a step-by-step thinking process that mimics how humans solve these puzzles.
What's the problem?
Current computer systems for guessing where photos were taken aren't very accurate and don't explain their reasoning well. The datasets used to train these systems are often small and not very good, with some photos being too easy to guess and others being too hard.
What's the solution?
The researchers created three new tools: GeoComp, a massive dataset from a geography game played by 740,000 people; GeoCoT, a new way for computers to think through location clues step-by-step like humans do; and GeoEval, a method to test how well these new tools work. GeoComp provides millions of photos with location information, GeoCoT helps computers use clues from the photos more effectively, and GeoEval measures how accurate and understandable the results are.
Why it matters?
This matters because it could make computers much better at figuring out where photos were taken, which is useful for things like navigation, monitoring changes in different places, and preserving cultural information. The new method is 25% more accurate than previous ones and can explain its reasoning, making it more trustworthy and useful for real-world applications.
Abstract
Geolocation, the task of identifying an image's location, requires complex reasoning and is crucial for navigation, monitoring, and cultural preservation. However, current methods often produce coarse, imprecise, and non-interpretable localization. A major challenge lies in the quality and scale of existing geolocation datasets. These datasets are typically small-scale and automatically constructed, leading to noisy data and inconsistent task difficulty, with images that either reveal answers too easily or lack sufficient clues for reliable inference. To address these challenges, we introduce a comprehensive geolocation framework with three key components: GeoComp, a large-scale dataset; GeoCoT, a novel reasoning method; and GeoEval, an evaluation metric, collectively designed to address critical challenges and drive advancements in geolocation research. At the core of this framework is GeoComp (Geolocation Competition Dataset), a large-scale dataset collected from a geolocation game platform involving 740K users over two years. It comprises 25 million entries of metadata and 3 million geo-tagged locations spanning much of the globe, with each location annotated thousands to tens of thousands of times by human users. The dataset offers diverse difficulty levels for detailed analysis and highlights key gaps in current models. Building on this dataset, we propose Geographical Chain-of-Thought (GeoCoT), a novel multi-step reasoning framework designed to enhance the reasoning capabilities of Large Vision Models (LVMs) in geolocation tasks. GeoCoT improves performance by integrating contextual and spatial cues through a multi-step process that mimics human geolocation reasoning. Finally, using the GeoEval metric, we demonstrate that GeoCoT significantly boosts geolocation accuracy by up to 25% while enhancing interpretability.