HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

Haiyang Zhou, Xinhua Cheng, Wangbo Yu, Yonghong Tian, Li Yuan

2024-07-23

HoloDreamer: Holistic 3D Panoramic World Generation from Text Descriptions

Summary

This paper introduces HoloDreamer, a new system that creates detailed 3D panoramic worlds based on simple text descriptions. It aims to revolutionize how we generate 3D scenes for applications like virtual reality and gaming.

What's the problem?

Generating 3D scenes from text can be challenging because existing methods often produce inconsistent results and may not capture the full details of a scene. Traditional approaches typically start with a small image and expand it, which can lead to gaps or mismatches in the final scene. This limits the quality and usability of the generated 3D environments.

What's the solution?

HoloDreamer addresses these issues by first creating a high-definition panoramic image as a complete starting point for the 3D scene. It then uses a technique called 3D Gaussian Splatting to efficiently build the full 3D environment. The system combines multiple diffusion models to generate stylized and detailed panoramas from complex text prompts. Additionally, it employs a two-stage process to fill in any missing areas and enhance the overall quality of the scene.

Why it matters?

This research is important because it allows for the creation of immersive and realistic 3D environments from just a few words. By improving how we generate these scenes, HoloDreamer opens up new possibilities for virtual reality experiences, video games, and film production, making it easier for creators to bring their ideas to life in a visually stunning way.

Abstract

3D scene generation is in high demand across various domains, including virtual reality, gaming, and the film industry. Owing to the powerful generative capabilities of text-to-image diffusion models that provide reliable priors, the creation of 3D scenes using only text prompts has become viable, thereby significantly advancing researches in text-driven 3D scene generation. In order to obtain multiple-view supervision from 2D diffusion models, prevailing methods typically employ the diffusion model to generate an initial local image, followed by iteratively outpainting the local image using diffusion models to gradually generate scenes. Nevertheless, these outpainting-based approaches prone to produce global inconsistent scene generation results without high degree of completeness, restricting their broader applications. To tackle these problems, we introduce HoloDreamer, a framework that first generates high-definition panorama as a holistic initialization of the full 3D scene, then leverage 3D Gaussian Splatting (3D-GS) to quickly reconstruct the 3D scene, thereby facilitating the creation of view-consistent and fully enclosed 3D scenes. Specifically, we propose Stylized Equirectangular Panorama Generation, a pipeline that combines multiple diffusion models to enable stylized and detailed equirectangular panorama generation from complex text prompts. Subsequently, Enhanced Two-Stage Panorama Reconstruction is introduced, conducting a two-stage optimization of 3D-GS to inpaint the missing region and enhance the integrity of the scene. Comprehensive experiments demonstrated that our method outperforms prior works in terms of overall visual consistency and harmony as well as reconstruction quality and rendering robustness when generating fully enclosed scenes.

View Paper