The Sonar Moment: Benchmarking Audio-Language Models in Audio Geo-Localization
Ruixing Zhang, Zihan Liu, Leilei Sun, Tongyu Zhu, Weifeng Lv
2026-01-07
Summary
This research introduces a new way to test how well computer programs can figure out *where* a sound was recorded, using audio and language models.
What's the problem?
Determining the location of a sound recording is much harder than figuring out the location from a picture. This is because there aren't many good datasets that pair sounds with their exact locations, making it difficult to train and test these kinds of programs. Existing methods struggled because they lacked enough reliable audio samples linked to specific places.
What's the solution?
The researchers created a new dataset called AGL1K, which contains over 1,400 sound clips from 72 different countries. They developed a method to identify sounds that are actually useful for determining location – sounds that have clues about where they were recorded. They then tested 16 different audio and language models on this dataset to see how well they could pinpoint the origin of the sounds.
Why it matters?
This work provides a standard test for audio geo-localization, meaning researchers can now compare different models more fairly. It shows that these models *can* actually figure out location from sound, but that some (especially those not publicly available) are much better than others. Understanding how these models make their decisions, and where they struggle, can help improve their ability to reason about locations and potentially be used for things like public safety applications.
Abstract
Geo-localization aims to infer the geographic origin of a given signal. In computer vision, geo-localization has served as a demanding benchmark for compositional reasoning and is relevant to public safety. In contrast, progress on audio geo-localization has been constrained by the lack of high-quality audio-location pairs. To address this gap, we introduce AGL1K, the first audio geo-localization benchmark for audio language models (ALMs), spanning 72 countries and territories. To extract reliably localizable samples from a crowd-sourced platform, we propose the Audio Localizability metric that quantifies the informativeness of each recording, yielding 1,444 curated audio clips. Evaluations on 16 ALMs show that ALMs have emerged with audio geo-localization capability. We find that closed-source models substantially outperform open-source models, and that linguistic clues often dominate as a scaffold for prediction. We further analyze ALMs' reasoning traces, regional bias, error causes, and the interpretability of the localizability metric. Overall, AGL1K establishes a benchmark for audio geo-localization and may advance ALMs with better geospatial reasoning capability.