Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Xu Zheng, Zihao Dongfang, Lutao Jiang, Boyuan Zheng, Yulong Guo, Zhenquan Zhang, Giuliano Albanese, Runyi Yang, Mengjiao Ma, Zixin Zhang, Chenfei Liao, Dingcheng Zhen, Yuanhuiyi Lyu, Yuqian Fu, Bin Ren, Linfeng Zhang, Danda Pani Paudel, Nicu Sebe, Luc Van Gool, Xuming Hu

2025-10-30

Multimodal Spatial Reasoning in the Large Model Era: A Survey and Benchmarks

Summary

This paper is a comprehensive overview of how artificial intelligence, specifically large models that can process multiple types of information like images and text, are learning to understand and reason about spaces like humans do.

What's the problem?

While AI models are getting better at understanding spaces, there hasn't been a good way to systematically track their progress or fairly compare different models. There's a lack of standardized tests and a clear understanding of what techniques are most effective for teaching these models spatial reasoning.

What's the solution?

The authors reviewed a lot of recent research in this area, grouping different approaches to spatial reasoning in AI. They looked at how models handle things like understanding relationships between objects, interpreting scenes, and even navigating in 3D environments. Importantly, they also created some new, publicly available tests to help researchers evaluate and compare these models more effectively, and shared the code for these tests online.

Why it matters?

This work is important because it provides a foundation for further research in spatial reasoning for AI. By organizing the current state of the field and providing tools for evaluation, it helps researchers build even more capable and intelligent systems that can interact with the physical world in a meaningful way, potentially leading to advancements in areas like robotics, self-driving cars, and virtual reality.

Abstract

Humans possess spatial reasoning abilities that enable them to understand spaces through multimodal observations, such as vision and sound. Large multimodal reasoning models extend these abilities by learning to perceive and reason, showing promising performance across diverse spatial tasks. However, systematic reviews and publicly available benchmarks for these models remain limited. In this survey, we provide a comprehensive review of multimodal spatial reasoning tasks with large models, categorizing recent progress in multimodal large language models (MLLMs) and introducing open benchmarks for evaluation. We begin by outlining general spatial reasoning, focusing on post-training techniques, explainability, and architecture. Beyond classical 2D tasks, we examine spatial relationship reasoning, scene and layout understanding, as well as visual question answering and grounding in 3D space. We also review advances in embodied AI, including vision-language navigation and action models. Additionally, we consider emerging modalities such as audio and egocentric video, which contribute to novel spatial understanding through new sensors. We believe this survey establishes a solid foundation and offers insights into the growing field of multimodal spatial reasoning. Updated information about this survey, codes and implementation of the open benchmarks can be found at https://github.com/zhengxuJosh/Awesome-Spatial-Reasoning.

View Paper