Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual Reasoning from Transit Maps

Sicheng Feng, Song Wang, Shuyi Ouyang, Lingdong Kong, Zikai Song, Jianke Zhu, Huan Wang, Xinchao Wang

2025-05-27

Can MLLMs Guide Me Home? A Benchmark Study on Fine-Grained Visual
Reasoning from Transit Maps

Summary

This paper talks about ReasonMap, a new way to test how well advanced AI models can understand and reason about detailed information on things like transit maps, which combine both pictures and words.

What's the problem?

The problem is that while multimodal large language models are supposed to be good at handling both images and text, it's not clear if they can really figure out complicated visual tasks, like helping someone navigate using a subway map, especially when the details are tricky.

What's the solution?

The researchers created ReasonMap, a special benchmark that checks how these models handle fine-grained visual and spatial reasoning. They found that the basic models sometimes do better than the ones designed specifically for reasoning, and that true visual understanding is still a challenge for these AIs.

Why it matters?

This is important because if we want AI to help people with real-world navigation and other visually complex tasks, we need to know how well they actually understand images, so we can build better and more reliable systems.

Abstract

ReasonMap evaluates the fine-grained visual understanding and spatial reasoning abilities of multimodal large language models, revealing that base models often outperform reasoning variants and highlighting the importance of genuine visual perception for complex tasks.

View Paper