SpatialTree: How Spatial Abilities Branch Out in MLLMs
Yuxi Xiao, Longfei Li, Shen Yan, Xinhang Liu, Sida Peng, Yunchao Wei, Xiaowei Zhou, Bingyi Kang
2025-12-24
Summary
This research investigates how well large multimodal models – those that can understand both images and text – perform on tasks requiring spatial reasoning, like understanding where things are in relation to each other. It proposes a way to categorize these spatial skills and then tests existing models to see how they stack up, ultimately trying to figure out how to make them better at these kinds of tasks.
What's the problem?
Current research on spatial reasoning in these models is pretty scattered. We don't have a good understanding of *how* spatial abilities develop in these AI systems, or how different spatial skills relate to each other. Most studies only look at a few specific tasks, making it hard to get a complete picture. Essentially, we need a way to systematically measure and improve spatial intelligence in AI.
What's the solution?
The researchers created something called 'SpatialTree,' which breaks down spatial reasoning into four levels, starting with basic visual perception and moving up to complex problem-solving and acting in a space. They then built a set of 27 tests based on this framework to evaluate how well current models perform at each level. They also experimented with 'fine-tuning' – essentially retraining the models – to see if they could improve performance, and found that improving basic perception skills didn't always help, but improving lower-level skills *did* boost higher-level abilities. Finally, they found that simply encouraging the AI to 'think' more didn't always work, and developed a method to help it focus its reasoning when needed.
Why it matters?
This work is important because it provides a structured way to understand and improve spatial reasoning in AI. As AI systems become more integrated into our world – think self-driving cars or robots – their ability to understand and interact with physical space becomes crucial. By identifying the different levels of spatial ability and how they relate, and by developing methods to improve them, this research helps pave the way for more capable and reliable AI systems.
Abstract
Cognitive science suggests that spatial ability develops progressively-from perception to reasoning and interaction. Yet in multimodal LLMs (MLLMs), this hierarchy remains poorly understood, as most studies focus on a narrow set of tasks. We introduce SpatialTree, a cognitive-science-inspired hierarchy that organizes spatial abilities into four levels: low-level perception (L1), mental mapping (L2), simulation (L3), and agentic competence (L4). Based on this taxonomy, we construct the first capability-centric hierarchical benchmark, thoroughly evaluating mainstream MLLMs across 27 sub-abilities. The evaluation results reveal a clear structure: L1 skills are largely orthogonal, whereas higher-level skills are strongly correlated, indicating increasing interdependency. Through targeted supervised fine-tuning, we uncover a surprising transfer dynamic-negative transfer within L1, but strong cross-level transfer from low- to high-level abilities with notable synergy. Finally, we explore how to improve the entire hierarchy. We find that naive RL that encourages extensive "thinking" is unreliable: it helps complex reasoning but hurts intuitive perception. We propose a simple auto-think strategy that suppresses unnecessary deliberation, enabling RL to consistently improve performance across all levels. By building SpatialTree, we provide a proof-of-concept framework for understanding and systematically scaling spatial abilities in MLLMs.