You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Baorui Ma, Huachen Gao, Haoge Deng, Zhengxiong Luo, Tiejun Huang, Lulu Tang, Xinlong Wang

2024-12-10

You See it, You Got it: Learning 3D Creation on Pose-Free Videos at Scale

Summary

This paper talks about See3D, a new model that learns to create 3D content from videos without needing specific camera position information, allowing for easier and more flexible 3D generation.

What's the problem?

Most current methods for creating 3D models rely on limited data and require detailed information about how the camera was positioned when the video was taken. This makes it hard to generate 3D content from everyday videos found online, which often lack this precise information.

What's the solution?

The authors introduce See3D, which uses a large collection of internet videos to learn how to create 3D models. They developed a new dataset called WebVi3D, which includes millions of frames from various videos. Instead of needing exact camera positions, See3D uses a visual signal that helps it understand the video content better. This allows the model to generate high-quality 3D representations from videos without the need for complex annotations.

Why it matters?

This research is important because it makes it easier to create 3D models from readily available video data, opening up new possibilities for applications in gaming, virtual reality, and more. By simplifying the process of generating 3D content, See3D can help artists and developers create immersive experiences without needing extensive resources or technical expertise.

Abstract

Recent 3D generation models typically rely on limited-scale 3D `gold-labels' or 2D diffusion priors for 3D content creation. However, their performance is upper-bounded by constrained 3D priors due to the lack of scalable learning paradigms. In this work, we present See3D, a visual-conditional multi-view diffusion model trained on large-scale Internet videos for open-world 3D creation. The model aims to Get 3D knowledge by solely Seeing the visual contents from the vast and rapidly growing video data -- You See it, You Got it. To achieve this, we first scale up the training data using a proposed data curation pipeline that automatically filters out multi-view inconsistencies and insufficient observations from source videos. This results in a high-quality, richly diverse, large-scale dataset of multi-view images, termed WebVi3D, containing 320M frames from 16M video clips. Nevertheless, learning generic 3D priors from videos without explicit 3D geometry or camera pose annotations is nontrivial, and annotating poses for web-scale videos is prohibitively expensive. To eliminate the need for pose conditions, we introduce an innovative visual-condition - a purely 2D-inductive visual signal generated by adding time-dependent noise to the masked video data. Finally, we introduce a novel visual-conditional 3D generation framework by integrating See3D into a warping-based pipeline for high-fidelity 3D generation. Our numerical and visual comparisons on single and sparse reconstruction benchmarks show that See3D, trained on cost-effective and scalable video data, achieves notable zero-shot and open-world generation capabilities, markedly outperforming models trained on costly and constrained 3D datasets. Please refer to our project page at: https://vision.baai.ac.cn/see3d

View Paper