The GeoVista model is evaluated on the GeoBench benchmark, which includes photos and panoramas from around the world, along with satellite images of different cities. The evaluation pipeline consists of level-wise evaluation and nuanced evaluation, which extracts the predicted address and computes the haversine distance to the ground-truth location. GeoVista surpasses other open-source agentic models on the geolocation task and achieves performance comparable to closed-source models.
GeoVista has a hierarchical reward system that leverages multi-level geographical information to improve overall geolocation performance. The model iteratively generates thoughts and actions, parsing and executing each action to yield a new observation. This process repeats until it outputs a final geolocation prediction or reaches the maximum interaction turn limit. GeoVista's performance is demonstrated through a demo video, which showcases its capabilities in geolocation tasks.

