DiffSplat: Repurposing Image Diffusion Models for Scalable Gaussian Splat Generation
Chenguo Lin, Panwang Pan, Bangbang Yang, Zeming Li, Yadong Mu
2025-01-29

Summary
This paper talks about DiffSplat, a new way to create 3D content using AI that's normally used for making 2D images. It's like teaching an artist who's really good at drawing pictures how to sculpt 3D objects.
What's the problem?
Creating 3D content from text descriptions or single images is hard because there aren't many good 3D examples to learn from, and when AI tries to make 3D stuff by looking at it from different angles, it often gets confused and makes things that don't match up. It's like trying to build a 3D model of a house when you've only seen flat pictures of houses and don't have many real 3D house models to study.
What's the solution?
The researchers created DiffSplat, which uses AI that's great at understanding 2D images to help make 3D content. They use something called 'Gaussian splats', which are like smart 3D pixels, to build the 3D scenes. DiffSplat learns from lots of 2D images on the internet but keeps everything consistent in 3D. They also made a quick way to turn 2D views into 3D data for training and added special checks to make sure the 3D output looks good from any angle.
Why it matters?
This matters because it could make creating 3D content much easier and faster. Instead of needing lots of 3D models to learn from, DiffSplat can use the huge amount of 2D images available online. This could lead to better virtual reality experiences, more realistic video games, and new ways to design products or plan buildings. It's a big step towards making 3D creation as easy as describing what you want or showing a single picture.
Abstract
Recent advancements in 3D content generation from text or a single image struggle with limited high-quality 3D datasets and inconsistency from 2D multi-view generation. We introduce DiffSplat, a novel 3D generative framework that natively generates 3D Gaussian splats by taming large-scale text-to-image diffusion models. It differs from previous 3D generative models by effectively utilizing web-scale 2D priors while maintaining 3D consistency in a unified model. To bootstrap the training, a lightweight reconstruction model is proposed to instantly produce multi-view Gaussian splat grids for scalable dataset curation. In conjunction with the regular diffusion loss on these grids, a 3D rendering loss is introduced to facilitate 3D coherence across arbitrary views. The compatibility with image diffusion models enables seamless adaptions of numerous techniques for image generation to the 3D realm. Extensive experiments reveal the superiority of DiffSplat in text- and image-conditioned generation tasks and downstream applications. Thorough ablation studies validate the efficacy of each critical design choice and provide insights into the underlying mechanism.