Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

Paul Henderson, Melonie de Almeida, Daniela Ivanova, Titas Anciukevičius

2024-06-21

Sampling 3D Gaussian Scenes in Seconds with Latent Diffusion Models

Summary

This paper introduces a new method for quickly generating 3D scenes using latent diffusion models, which can be trained using only 2D images.

What's the problem?

Creating detailed 3D scenes usually requires a lot of specific information, like object shapes and depth data. This can be complicated and time-consuming, especially when dealing with complex scenes from different camera angles. Traditional methods often need extensive data and can be slow, making it hard to generate scenes quickly and efficiently.

What's the solution?

The researchers developed a latent diffusion model that starts by using an autoencoder to convert multiple 2D images into a simpler 3D representation called Gaussian splats. This process also creates a compressed version of the data that makes it easier to work with. They then train a diffusion model on this simplified data, allowing the system to generate 3D scenes rapidly. Their method does not require detailed information like object masks or depth data, making it suitable for various complex scenes. In their experiments with large datasets, they demonstrated that their model could create 3D scenes in just 0.2 seconds, whether starting from scratch or using one or more input views.

Why it matters?

This research is important because it significantly speeds up the process of generating realistic 3D scenes, making it much more accessible for applications like virtual reality, gaming, and architectural visualization. By simplifying the requirements for creating these scenes, this method opens up new possibilities for developers and artists to create rich visual environments quickly and efficiently.

Abstract

We present a latent diffusion model over 3D scenes, that can be trained using only 2D image data. To achieve this, we first design an autoencoder that maps multi-view images to 3D Gaussian splats, and simultaneously builds a compressed latent representation of these splats. Then, we train a multi-view diffusion model over the latent space to learn an efficient generative model. This pipeline does not require object masks nor depths, and is suitable for complex scenes with arbitrary camera positions. We conduct careful experiments on two large-scale datasets of complex real-world scenes -- MVImgNet and RealEstate10K. We show that our approach enables generating 3D scenes in as little as 0.2 seconds, either from scratch, from a single input view, or from sparse input views. It produces diverse and high-quality results while running an order of magnitude faster than non-latent diffusion models and earlier NeRF-based generative models

View Paper