Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation

Xiaoming Zhu, Xu Huang, Qinghongbing Xie, Zhi Deng, Junsheng Yu, Yirui Guan, Zhongyuan Liu, Lin Zhu, Qijun Zhao, Ligang Liu, Long Zeng

2025-10-20

Imaginarium: Vision-guided High-Quality 3D Scene Layout Generation

Summary

This paper introduces a new system for automatically creating realistic and visually appealing 3D scenes from text prompts, aiming to make digital content creation easier.

What's the problem?

Creating good 3D scenes is hard. Traditionally, people had to manually set a lot of rules, which is time-consuming. While AI can help, existing AI methods often struggle to create scenes that are both detailed and varied, or they don't accurately understand how objects should be positioned relative to each other in a realistic way.

What's the solution?

The researchers built a system that uses a combination of techniques. First, they created a large collection of 3D objects. Then, they used an AI image generator to turn text descriptions into images, making sure those images matched the style of their 3D objects. Next, they developed a way to 'read' those images and figure out where the 3D objects should go in the scene. Finally, they refined the scene layout to make sure everything looked logical and consistent with the original image and description, using a system that understands relationships between objects.

Why it matters?

This work is important because it offers a significant improvement over existing methods for generating 3D scenes. User testing showed that the scenes created by this system are richer and higher quality, which could be really useful for things like video game development, virtual reality, or creating visual effects for movies.

Abstract

Generating artistic and coherent 3D scene layouts is crucial in digital content creation. Traditional optimization-based methods are often constrained by cumbersome manual rules, while deep generative models face challenges in producing content with richness and diversity. Furthermore, approaches that utilize large language models frequently lack robustness and fail to accurately capture complex spatial relationships. To address these challenges, this paper presents a novel vision-guided 3D layout generation system. We first construct a high-quality asset library containing 2,037 scene assets and 147 3D scene layouts. Subsequently, we employ an image generation model to expand prompt representations into images, fine-tuning it to align with our asset library. We then develop a robust image parsing module to recover the 3D layout of scenes based on visual semantics and geometric information. Finally, we optimize the scene layout using scene graphs and overall visual semantics to ensure logical coherence and alignment with the images. Extensive user testing demonstrates that our algorithm significantly outperforms existing methods in terms of layout richness and quality. The code and dataset will be available at https://github.com/HiHiAllen/Imaginarium.

View Paper