Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

Rishab Parthasarathy, Zack Ankner, Aaron Gokaslan

2024-06-18

Vid3D: Synthesis of Dynamic 3D Scenes using 2D Video Diffusion

Summary

This paper introduces Vid3D, a new model for creating dynamic 3D videos from 2D video inputs. It explores whether it's necessary to keep the consistency of multiple views over time when generating 3D scenes.

What's the problem?

Generating 3D videos can be complicated because many existing methods require maintaining consistency across different views and time, which can be difficult and resource-intensive. This means that creating high-quality 3D animations often involves complex calculations to ensure everything looks right from various angles as time progresses.

What's the solution?

The authors of this paper propose Vid3D, which simplifies the process by first creating a 2D outline of the video's motion and then generating a separate 3D representation for each moment in the video independently. This approach allows the model to focus on each frame without worrying about maintaining consistency with other frames. They tested Vid3D against two leading methods in 3D video generation and found that it produced similar quality results without needing to enforce strict temporal dynamics between frames.

Why it matters?

This research is important because it suggests that creating high-quality dynamic 3D scenes can be done more simply than previously thought. By not requiring complex consistency checks across time and views, Vid3D could make it easier and faster to generate realistic 3D animations. This advancement could benefit fields like gaming, virtual reality, and film production, where quick and effective 3D content creation is essential.

Abstract

A recent frontier in computer vision has been the task of 3D video generation, which consists of generating a time-varying 3D representation of a scene. To generate dynamic 3D scenes, current methods explicitly model 3D temporal dynamics by jointly optimizing for consistency across both time and views of the scene. In this paper, we instead investigate whether it is necessary to explicitly enforce multiview consistency over time, as current approaches do, or if it is sufficient for a model to generate 3D representations of each timestep independently. We hence propose a model, Vid3D, that leverages 2D video diffusion to generate 3D videos by first generating a 2D "seed" of the video's temporal dynamics and then independently generating a 3D representation for each timestep in the seed video. We evaluate Vid3D against two state-of-the-art 3D video generation methods and find that Vid3D is achieves comparable results despite not explicitly modeling 3D temporal dynamics. We further ablate how the quality of Vid3D depends on the number of views generated per frame. While we observe some degradation with fewer views, performance degradation remains minor. Our results thus suggest that 3D temporal knowledge may not be necessary to generate high-quality dynamic 3D scenes, potentially enabling simpler generative algorithms for this task.

View Paper