Around the World in 80 Timesteps: A Generative Approach to Global Visual Geolocation
Nicolas Dufour, David Picard, Vicky Kalogeiton, Loic Landrieu
2024-12-10

Summary
This paper talks about a new method for predicting where an image was taken on Earth using a generative approach called global visual geolocation, which accounts for the uncertainty in image localization.
What's the problem?
Predicting the exact location of an image can be very challenging because images vary in how clearly they show where they were taken. Most existing methods try to give a single answer for the location, but they often ignore the uncertainty involved, leading to less accurate results.
What's the solution?
The authors propose a generative approach that uses advanced techniques like diffusion and Riemannian flow matching to improve geolocation accuracy. Instead of just providing one location, their method predicts a range of possible locations and gives a probability for each one. This allows the model to better handle the ambiguity in images and provides a more reliable estimate of where the photo was taken. They tested this method on several benchmarks and found it performed better than previous techniques.
Why it matters?
This research is important because it enhances how we can determine the locations of images, which has applications in mapping, tourism, and even social media. By providing a more nuanced understanding of where photos are taken, this approach can help improve technologies that rely on visual data, making them more effective and user-friendly.
Abstract
Global visual geolocation predicts where an image was captured on Earth. Since images vary in how precisely they can be localized, this task inherently involves a significant degree of ambiguity. However, existing approaches are deterministic and overlook this aspect. In this paper, we aim to close the gap between traditional geolocalization and modern generative methods. We propose the first generative geolocation approach based on diffusion and Riemannian flow matching, where the denoising process operates directly on the Earth's surface. Our model achieves state-of-the-art performance on three visual geolocation benchmarks: OpenStreetView-5M, YFCC-100M, and iNat21. In addition, we introduce the task of probabilistic visual geolocation, where the model predicts a probability distribution over all possible locations instead of a single point. We introduce new metrics and baselines for this task, demonstrating the advantages of our diffusion-based approach. Codes and models will be made available.