Spatial Mental Modeling from Limited Views
Baiqiao Yin, Qineng Wang, Pingyue Zhang, Jianshu Zhang, Kangrui Wang, Zihan Wang, Jieyu Zhang, Keshigeyan Chandrasegaran, Han Liu, Ranjay Krishna, Saining Xie, Manling Li, Jiajun Wu, Li Fei-Fei
2025-06-30
Summary
This paper talks about MindCube, a new benchmark designed to test and improve how vision-language models understand and think about spaces when they only see limited views, kind of like how humans imagine a room after just peeking inside.
What's the problem?
AI models often struggle to figure out what is happening in unseen areas of a scene when given only a few images or angles, which makes it hard for them to accurately understand layouts, perspectives, or predict what would happen if things move.
What's the solution?
MindCube provides a large set of images and spatial reasoning questions that encourage models to build internal mental maps and reason about positions, viewpoints, and hypothetical changes. By training models to first create an internal cognitive map and then reason over it, their spatial understanding is significantly improved.
Why it matters?
This matters because better spatial reasoning helps AI perform more like humans in tasks involving navigation, virtual reality, robotics, and any situation where it needs to understand or make decisions about spaces it can't fully see.
Abstract
A new benchmark, MindCube, shows that VLMs can improve their understanding of unseen spaces by forming internal spatial representations and reasoning over them.