Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Xianjin Wu, Dingkang Liang, Tianrui Feng, Kui Xia, Yumeng Zhang, Xiaofan Li, Xiao Tan, Xiang Bai

2026-03-20

Generation Models Know Space: Unleashing Implicit 3D Priors for Scene Understanding

Summary

This paper addresses the issue of Multimodal Large Language Models (MLLMs) not being very good at understanding the physical world, specifically things like shapes, sizes, and how objects move. It proposes a new way to give these models a better grasp of 3D space without needing a lot of extra 3D data.

What's the problem?

MLLMs are really good at understanding what things *are* based on text and images, but they struggle with understanding where things are in space and how they interact physically. Current solutions to this problem often require a lot of 3D data, which is hard to get and doesn't always work well in new situations. Basically, they lack 'spatial awareness' and can't reason about the physical world very well.

What's the solution?

The researchers realized that models designed to *create* realistic videos already have a pretty good understanding of 3D space and physics because they need to make videos look believable. They developed a system called VEGA-3D that takes a pre-trained video generation model and uses it as a 'virtual world simulator'. It extracts information about shapes and movement from the video model and combines it with the text and image information the MLLM already has, helping the MLLM understand the 3D world without needing explicit 3D training data.

Why it matters?

This work is important because it offers a more practical and scalable way to improve MLLMs' understanding of the physical world. Instead of relying on scarce 3D data, it leverages the knowledge already embedded in video generation models. This could lead to better performance in tasks like robotics, virtual reality, and any application where understanding the 3D environment is crucial.

Abstract

While Multimodal Large Language Models demonstrate impressive semantic capabilities, they often suffer from spatial blindness, struggling with fine-grained geometric reasoning and physical dynamics. Existing solutions typically rely on explicit 3D modalities or complex geometric scaffolding, which are limited by data scarcity and generalization challenges. In this work, we propose a paradigm shift by leveraging the implicit spatial prior within large-scale video generation models. We posit that to synthesize temporally coherent videos, these models inherently learn robust 3D structural priors and physical laws. We introduce VEGA-3D (Video Extracted Generative Awareness), a plug-and-play framework that repurposes a pre-trained video diffusion model as a Latent World Simulator. By extracting spatiotemporal features from intermediate noise levels and integrating them with semantic representations via a token-level adaptive gated fusion mechanism, we enrich MLLMs with dense geometric cues without explicit 3D supervision. Extensive experiments across 3D scene understanding, spatial reasoning, and embodied manipulation benchmarks demonstrate that our method outperforms state-of-the-art baselines, validating that generative priors provide a scalable foundation for physical-world understanding. Code is publicly available at https://github.com/H-EmbodVis/VEGA-3D.

View Paper