Wonderland: Navigating 3D Scenes from a Single Image
Hanwen Liang, Junli Cao, Vidit Goel, Guocheng Qian, Sergei Korolev, Demetri Terzopoulos, Konstantinos N. Plataniotis, Sergey Tulyakov, Jian Ren
2024-12-17

Summary
This paper presents Wonderland, a new system that creates detailed 3D scenes from just one image. It allows users to explore these scenes as if they were in a virtual reality environment, making it easier to visualize spaces and objects.
What's the problem?
Creating 3D scenes from a single image is challenging because most existing methods need multiple images or take a long time to optimize each scene. Additionally, these methods often produce low-quality backgrounds and can distort parts of the scene that weren't visible in the original image.
What's the solution?
Wonderland introduces a novel approach that uses a large-scale reconstruction model to generate 3D scenes efficiently. It employs a video diffusion model to create compressed representations of the scene that include multi-view information, ensuring consistency in the 3D structure. This allows the system to generate high-quality 3D representations quickly and effectively from just one input image.
Why it matters?
This technology is significant because it opens up new possibilities for various applications, such as virtual reality, gaming, and architectural visualization. By enabling high-quality 3D scene generation from a single image, it can enhance how we interact with digital content and improve our ability to visualize complex environments.
Abstract
This paper addresses a challenging question: How can we efficiently create high-quality, wide-scope 3D scenes from a single arbitrary image? Existing methods face several constraints, such as requiring multi-view data, time-consuming per-scene optimization, low visual quality in backgrounds, and distorted reconstructions in unseen areas. We propose a novel pipeline to overcome these limitations. Specifically, we introduce a large-scale reconstruction model that uses latents from a video diffusion model to predict 3D Gaussian Splattings for the scenes in a feed-forward manner. The video diffusion model is designed to create videos precisely following specified camera trajectories, allowing it to generate compressed video latents that contain multi-view information while maintaining 3D consistency. We train the 3D reconstruction model to operate on the video latent space with a progressive training strategy, enabling the efficient generation of high-quality, wide-scope, and generic 3D scenes. Extensive evaluations across various datasets demonstrate that our model significantly outperforms existing methods for single-view 3D scene generation, particularly with out-of-domain images. For the first time, we demonstrate that a 3D reconstruction model can be effectively built upon the latent space of a diffusion model to realize efficient 3D scene generation.