World-consistent Video Diffusion with Explicit 3D Modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista, Kevin Miao, Alexander Toshev, Joshua Susskind, Jiatao Gu

2024-12-03

World-consistent Video Diffusion with Explicit 3D Modeling

Summary

This paper introduces World-consistent Video Diffusion (WVD), a new framework that improves video generation by explicitly modeling 3D information to create videos that are consistent and realistic.

What's the problem?

Current video diffusion models have made great strides in generating images and videos, but they often struggle to create content that is consistent in three dimensions (3D). This inconsistency can lead to unrealistic or confusing visuals, especially when the videos are supposed to represent real-world scenes. Additionally, generating videos efficiently while maintaining quality is a challenge.

What's the solution?

WVD addresses this problem by incorporating explicit 3D supervision using XYZ images, which provide global 3D coordinates for each pixel in the video. The model learns the relationship between color (RGB) and spatial information (XYZ) so it can generate videos that look good from different angles. It also uses a flexible inpainting strategy, meaning it can fill in missing parts of images or create new frames based on existing ones. This allows WVD to handle various tasks, such as converting single images into 3D representations or generating videos that follow a specific camera path.

Why it matters?

This research is significant because it enhances the ability of AI to generate high-quality, realistic videos that maintain consistency across different views. By improving how videos are created, WVD can be applied in many areas such as film production, video games, virtual reality, and education, where accurate representation of 3D space is essential.

Abstract

Recent advancements in diffusion models have set new benchmarks in image and video generation, enabling realistic visual synthesis across single- and multi-frame contexts. However, these models still struggle with efficiently and explicitly generating 3D-consistent content. To address this, we propose World-consistent Video Diffusion (WVD), a novel framework that incorporates explicit 3D supervision using XYZ images, which encode global 3D coordinates for each image pixel. More specifically, we train a diffusion transformer to learn the joint distribution of RGB and XYZ frames. This approach supports multi-task adaptability via a flexible inpainting strategy. For example, WVD can estimate XYZ frames from ground-truth RGB or generate novel RGB frames using XYZ projections along a specified camera trajectory. In doing so, WVD unifies tasks like single-image-to-3D generation, multi-view stereo, and camera-controlled video generation. Our approach demonstrates competitive performance across multiple benchmarks, providing a scalable solution for 3D-consistent video and image generation with a single pretrained model.

View Paper