NAVIG: Natural Language-guided Analysis with Vision Language Models for Image Geo-localization

Zheyuan Zhang, Runze Li, Tasnim Kabir, Jordan Boyd-Graber

2025-02-21

NAVIG: Natural Language-guided Analysis with Vision Language Models for
Image Geo-localization

Summary

This paper talks about NAVIG, a new system that helps AI figure out where a photo was taken by using both visual clues and language-based reasoning, similar to how humans solve geography puzzles.

What's the problem?

Current AI models are good at recognizing what's in an image, but they're not great at figuring out exactly where the photo was taken. This is because they lack the ability to reason about cultural and geographical clues in the same way humans do. There also aren't many good datasets that show how experts think through this process.

What's the solution?

The researchers created two things to solve this problem. First, they made NaviClues, a dataset based on the game GeoGuessr, which shows how expert players figure out where photos were taken. Then, they used this dataset to create NAVIG, an AI system that combines looking at the image with thinking through clues using language, just like a human would. This helps the AI understand not just what it sees, but what that means about where the photo might have been taken.

Why it matters?

This matters because it makes AI much better at figuring out where photos were taken, which could be useful for things like organizing travel photos, helping with geography education, or even assisting in emergency situations where knowing the exact location of an image is crucial. The new system is 14% more accurate than previous best methods, and it achieves this improvement while needing less training data, which means it could be more efficient and easier to use in real-world applications.

Abstract

Image geo-localization is the task of predicting the specific location of an image and requires complex reasoning across visual, geographical, and cultural contexts. While prior Vision Language Models (VLMs) have the best accuracy at this task, there is a dearth of high-quality datasets and models for analytical reasoning. We first create NaviClues, a high-quality dataset derived from GeoGuessr, a popular geography game, to supply examples of expert reasoning from language. Using this dataset, we present Navig, a comprehensive image geo-localization framework integrating global and fine-grained image information. By reasoning with language, Navig reduces the average distance error by 14% compared to previous state-of-the-art models while requiring fewer than 1000 training samples. Our dataset and code are available at https://github.com/SparrowZheyuan18/Navig/.

View Paper