SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Hunar Batra, Haoqin Tu, Hardy Chen, Yuanze Lin, Cihang Xie, Ronald Clark

2025-11-17

SpatialThinker: Reinforcing 3D Reasoning in Multimodal LLMs via Spatial Rewards

Summary

This paper introduces SpatialThinker, a new type of AI model that's really good at understanding images and language together, specifically when it comes to spatial relationships – things like 'left of,' 'behind,' or 'inside.' It aims to make these AI models better at reasoning about the 3D world like humans do.

What's the problem?

Current AI models that combine vision and language struggle with understanding how things relate to each other in space. They often need a lot of specific 3D information or complicated changes to their design to even begin to grasp these concepts, and they usually require massive amounts of training data. Essentially, they aren't very good at 'thinking' about space in a way that's similar to how people do, and they need a ton of examples to learn even basic spatial ideas.

What's the solution?

The researchers created SpatialThinker, which learns through a process similar to trial and error, called reinforcement learning. It builds a kind of map of the scene, identifying objects and how they're positioned relative to each other. Then, it uses this map to figure out the answer to questions about the image, getting 'rewards' for correctly identifying spatial relationships. They also created a new dataset of images and questions specifically designed to test spatial understanding. This combination of a new dataset and a learning process that focuses on spatial reasoning makes SpatialThinker much better at these tasks.

Why it matters?

This work is important because it shows a way to make AI models understand the 3D world with less data and more effectively. SpatialThinker performs better than other models, even surpassing very advanced models like GPT-4o, on tasks requiring spatial reasoning. This is a step towards AI that can truly 'see' and understand the world around it, which is crucial for applications like robotics, self-driving cars, and even helping people with visual impairments.

Abstract

Multimodal large language models (MLLMs) have achieved remarkable progress in vision-language tasks, but they continue to struggle with spatial understanding. Existing spatial MLLMs often rely on explicit 3D inputs or architecture-specific modifications, and remain constrained by large-scale datasets or sparse supervision. To address these limitations, we introduce SpatialThinker, a 3D-aware MLLM trained with RL to integrate structured spatial grounding with multi-step reasoning. The model simulates human-like spatial perception by constructing a scene graph of task-relevant objects and spatial relations, and reasoning towards an answer via dense spatial rewards. SpatialThinker consists of two key contributions: (1) a data synthesis pipeline that generates STVQA-7K, a high-quality spatial VQA dataset, and (2) online RL with a multi-objective dense spatial reward enforcing spatial grounding. SpatialThinker-7B outperforms supervised fine-tuning and the sparse RL baseline on spatial understanding and real-world VQA benchmarks, nearly doubling the base-model gain compared to sparse RL, and surpassing GPT-4o. These results showcase the effectiveness of combining spatial supervision with reward-aligned reasoning in enabling robust 3D spatial understanding with limited data and advancing MLLMs towards human-level visual reasoning.

View Paper