Think3D: Thinking with Space for Spatial Reasoning

Zaibin Zhang, Yuhan Wu, Lianjie Jia, Yifan Wang, Zhongbo Zhang, Yijiang Li, Binghao Ran, Fuxi Zhang, Zhuohan Sun, Zhenfei Yin, Lijun Wang, Huchuan Lu

2026-01-21

Think3D: Thinking with Space for Spatial Reasoning

Summary

This paper introduces a new way to help AI models, specifically those that 'see' and understand images (called Vision Large Models or VLMs), to better reason about the 3D world around them.

What's the problem?

Current AI models are really good at understanding what's in a 2D picture, but they struggle with tasks that require understanding depth, spatial relationships, and how objects exist in a three-dimensional space. They essentially 'see' the world flat, even though it isn't, limiting their ability to solve real-world problems that require 3D thinking.

What's the solution?

The researchers created a framework called Think3D. Think3D lets the AI actively explore a scene by virtually moving a camera around and switching between different viewpoints. It uses technology that can create 3D models from images or videos, allowing the AI to 'think' through a problem in 3D, step-by-step. They found that even without extra training, Think3D significantly improved the performance of powerful models like GPT-4 and Gemini. For less capable models, they used a technique called reinforcement learning to teach the AI to choose the best viewpoints to look at, further boosting performance.

Why it matters?

This work is important because it shows that we can improve an AI's 3D reasoning abilities simply by giving it tools to explore and interact with a scene, rather than needing to completely retrain the AI. This is a step towards creating AI that can understand and interact with the physical world more like humans do, opening up possibilities for more versatile and intelligent robots and virtual assistants.

Abstract

Understanding and reasoning about the physical world requires spatial intelligence: the ability to interpret geometry, perspective, and spatial relations beyond 2D perception. While recent vision large models (VLMs) excel at visual understanding, they remain fundamentally 2D perceivers and struggle with genuine 3D reasoning. We introduce Think3D, a framework that enables VLM agents to think with 3D space. By leveraging 3D reconstruction models that recover point clouds and camera poses from images or videos, Think3D allows the agent to actively manipulate space through camera-based operations and ego/global-view switching, transforming spatial reasoning into an interactive 3D chain-of-thought process. Without additional training, Think3D significantly improves the spatial reasoning performance of advanced models such as GPT-4.1 and Gemini 2.5 Pro, yielding average gains of +7.8% on BLINK Multi-view and MindCube, and +4.7% on VSI-Bench. We further show that smaller models, which struggle with spatial exploration, benefit significantly from a reinforcement learning policy that enables the model to select informative viewpoints and operations. With RL, the benefit from tool usage increases from +0.7% to +6.8%. Our findings demonstrate that training-free, tool-augmented spatial exploration is a viable path toward more flexible and human-like 3D reasoning in multimodal agents, establishing a new dimension of multimodal intelligence. Code and weights are released at https://github.com/zhangzaibin/spagent.

View Paper