Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

Vitor Guizilini, Muhammad Zubair Irshad, Dian Chen, Greg Shakhnarovich, Rares Ambrus

2025-02-03

Zero-Shot Novel View and Depth Synthesis with Multi-View Geometric Diffusion

Summary

This paper talks about a new way to create 3D scenes from a few 2D images using a smart AI system called MVGD. Unlike other methods that build 3D models first, MVGD can directly create new images and depth maps of the scene from different angles.

What's the problem?

Current methods for making 3D scenes from a few images usually need to create a 3D model first, which can be complicated and time-consuming. Also, these methods often struggle to make the scene look consistent from different viewpoints.

What's the solution?

The researchers developed MVGD, which uses a technique called diffusion to generate new views of a scene. It works by combining information from the input images with spatial information about where the camera is looking. MVGD is trained on millions of multi-view samples and can create both images and depth maps at the same time. They also found a clever way to train larger, more powerful versions of MVGD by starting with smaller models and gradually making them bigger.

Why it matters?

This matters because it could make creating 3D content much easier and faster. It could be used in virtual reality, video games, or even to help robots understand their surroundings better. The ability to create accurate depth maps along with images could be especially useful for tasks that need to understand the 3D structure of a scene. Plus, because MVGD can work with any number of input images, it's very flexible and could be used in many different situations.

Abstract

Current methods for 3D scene reconstruction from sparse posed images employ intermediate 3D representations such as neural fields, voxel grids, or 3D Gaussians, to achieve multi-view consistent scene appearance and geometry. In this paper we introduce MVGD, a diffusion-based architecture capable of direct pixel-level generation of images and depth maps from novel viewpoints, given an arbitrary number of input views. Our method uses raymap conditioning to both augment visual features with spatial information from different viewpoints, as well as to guide the generation of images and depth maps from novel views. A key aspect of our approach is the multi-task generation of images and depth maps, using learnable task embeddings to guide the diffusion process towards specific modalities. We train this model on a collection of more than 60 million multi-view samples from publicly available datasets, and propose techniques to enable efficient and consistent learning in such diverse conditions. We also propose a novel strategy that enables the efficient training of larger models by incrementally fine-tuning smaller ones, with promising scaling behavior. Through extensive experiments, we report state-of-the-art results in multiple novel view synthesis benchmarks, as well as multi-view stereo and video depth estimation.

View Paper