Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

Xiaochuan Li, Guoguang Du, Runze Zhang, Liang Jin, Qi Jia, Lihua Lu, Zhenhua Guo, Yaqian Zhao, Haiyang Liu, Tianqi Wang, Changsheng Li, Xiaoli Gong, Rengang Li, Baoyu Fan

2025-09-01

Droplet3D: Commonsense Priors from Videos Facilitate 3D Generation

Summary

This paper explores a new way to create 3D models by using videos as a learning tool, overcoming the challenge of limited 3D data available for training AI.

What's the problem?

Creating realistic 3D models requires a lot of 3D data, but there simply isn't much of it online compared to things like images or videos. This lack of data makes it hard for AI to learn to generate good 3D content and understand how things should look from all angles, leading to unrealistic or inconsistent results.

What's the solution?

The researchers realized that videos contain a lot of useful information about how the real world works – things like how objects look from different viewpoints and what they generally mean. They created a large dataset called Droplet3D-4M, which includes videos with detailed labels, and then trained a new AI model, Droplet3D, to generate 3D models using both images and text descriptions, learning from the information in the videos. This allows the AI to create 3D objects that are more consistent and make more sense.

Why it matters?

This work is important because it opens up a new path for creating 3D content, especially when you don't have a lot of existing 3D data. By leveraging the abundance of videos, we can build AI that can generate more realistic and complex 3D scenes, potentially moving beyond just creating individual objects to entire environments. The researchers also made all their resources publicly available, allowing others to build upon their work.

Abstract

Scaling laws have validated the success and promise of large-data-trained models in creative generation across text, image, and video domains. However, this paradigm faces data scarcity in the 3D domain, as there is far less of it available on the internet compared to the aforementioned modalities. Fortunately, there exist adequate videos that inherently contain commonsense priors, offering an alternative supervisory signal to mitigate the generalization bottleneck caused by limited native 3D data. On the one hand, videos capturing multiple views of an object or scene provide a spatial consistency prior for 3D generation. On the other hand, the rich semantic information contained within the videos enables the generated content to be more faithful to the text prompts and semantically plausible. This paper explores how to apply the video modality in 3D asset generation, spanning datasets to models. We introduce Droplet3D-4M, the first large-scale video dataset with multi-view level annotations, and train Droplet3D, a generative model supporting both image and dense text input. Extensive experiments validate the effectiveness of our approach, demonstrating its ability to produce spatially consistent and semantically plausible content. Moreover, in contrast to the prevailing 3D solutions, our approach exhibits the potential for extension to scene-level applications. This indicates that the commonsense priors from the videos significantly facilitate 3D creation. We have open-sourced all resources including the dataset, code, technical framework, and model weights: https://dropletx.github.io/.

View Paper