Exploring Spatial Intelligence from a Generative Perspective

Muzhi Zhu, Shunyao Jiang, Huanyi Zheng, Zekai Luo, Hao Zhong, Anzhou Li, Kaijun Wang, Jintao Rong, Yang Liu, Hao Chen, Tao Lin, Chunhua Shen

2026-04-23

Exploring Spatial Intelligence from a Generative Perspective

Summary

This paper investigates whether advanced AI models that can both understand and *create* images actually grasp 3D space, and if we can measure and improve this ability.

What's the problem?

Current tests for AI's understanding of spatial relationships mostly check if the AI can *recognize* things in images. However, it's unclear if these models can actually *use* that understanding to generate images that follow the rules of 3D space – like making sure objects are positioned realistically relative to each other. There wasn't a good way to test this 'generative spatial intelligence' or to train models to improve it.

What's the solution?

The researchers created a new testing ground called GSI-Bench. This includes two datasets: one with real-world images carefully created to have clear 3D structure (GSI-Real), and another with computer-generated images where the spatial relationships can be precisely controlled (GSI-Syn). They also developed a standard way to evaluate how well models perform on tasks that require respecting and changing spatial arrangements in images. They then showed that training models using the GSI-Syn dataset significantly improved their ability to create realistic images and even boosted their understanding of spatial relationships in existing images.

Why it matters?

This work is important because it shows that AI models can actually learn to *reason* about 3D space, not just memorize patterns. By developing a way to train models to generate images that follow spatial rules, we can build AI that's better at understanding and interacting with the real world, which has implications for robotics, design, and many other fields.

Abstract

Spatial intelligence is essential for multimodal large language models, yet current benchmarks largely assess it only from an understanding perspective. We ask whether modern generative or unified multimodal models also possess generative spatial intelligence (GSI), the ability to respect and manipulate 3D spatial constraints during image generation, and whether such capability can be measured or improved. We introduce GSI-Bench, the first benchmark designed to quantify GSI through spatially grounded image editing. It consists of two complementary components: GSI-Real, a high-quality real-world dataset built via a 3D-prior-guided generation and filtering pipeline, and GSI-Syn, a large-scale synthetic benchmark with controllable spatial operations and fully automated labeling. Together with a unified evaluation protocol, GSI-Bench enables scalable, model-agnostic assessment of spatial compliance and editing fidelity. Experiments show that fine-tuning unified multimodal models on GSI-Syn yields substantial gains on both synthetic and real tasks and, strikingly, also improves downstream spatial understanding. This provides the first clear evidence that generative training can tangibly strengthen spatial reasoning, establishing a new pathway for advancing spatial intelligence in multimodal models.

View Paper