"Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries

Jon E. Froehlich, Jared Hwang, Zeyu Wang, John S. O'Meara, Xia Su, William Huang, Yang Zhang, Alex Fiannaca, Philip Nelson, Shaun Kane

2025-08-22

"Does the cafe entrance look accessible? Where is the door?" Towards Geospatial AI Agents for Visual Inquiries

Summary

This paper introduces the idea of 'Geo-Visual Agents,' which are AI systems designed to understand and answer questions about what places actually *look* like, not just where things are located.

What's the problem?

Current digital maps are great for navigation and finding specific addresses, but they struggle with questions about visual characteristics of a place. They rely on organized data like road names and business listings, and can't easily answer questions like 'show me restaurants with outdoor seating' or 'what do buildings in this neighborhood look like?' because that information isn't neatly categorized in existing databases.

What's the solution?

The authors propose building AI agents that can 'see' and understand the world through images. These agents would analyze huge collections of photos from sources like Google Street View, travel websites, and satellite images, and combine that visual information with traditional map data to answer complex questions about a location's appearance and features. They present initial ideas for how these agents could work and give a few examples of what they could do.

Why it matters?

This research is important because it moves beyond simply knowing *where* things are to understanding *what* things are like. This could lead to much more useful and intuitive map applications, helping people explore the world, plan trips, and gain a better understanding of different places based on visual evidence.

Abstract

Interactive digital maps have revolutionized how people travel and learn about the world; however, they rely on pre-existing structured data in GIS databases (e.g., road networks, POI indices), limiting their ability to address geo-visual questions related to what the world looks like. We introduce our vision for Geo-Visual Agents--multimodal AI agents capable of understanding and responding to nuanced visual-spatial inquiries about the world by analyzing large-scale repositories of geospatial images, including streetscapes (e.g., Google Street View), place-based photos (e.g., TripAdvisor, Yelp), and aerial imagery (e.g., satellite photos) combined with traditional GIS data sources. We define our vision, describe sensing and interaction approaches, provide three exemplars, and enumerate key challenges and opportunities for future work.

View Paper