3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Hyun-kyu Ko, Jihyeon Park, Younghyun Kim, Dongheok Park, Eunbyung Park

2026-03-20

3DreamBooth: High-Fidelity 3D Subject-Driven Video Generation Model

Summary

This paper introduces a new method for creating videos where you can customize what's in them, but importantly, it makes sure the customized object looks correct from *any* angle, like it's a real 3D object.

What's the problem?

Current methods for customizing videos treat objects as if they're just flat pictures, not actual 3D shapes. This works okay for a single view, but when you try to see the object from a different angle, the system has to *guess* what the unseen parts should look like, leading to inconsistencies and a loss of the object's true identity. Also, getting enough video footage from multiple angles to train these systems properly is really hard and can lead to the system memorizing the training videos instead of learning generalizable 3D shapes.

What's the solution?

The researchers developed a two-part system called 3DreamBooth and 3Dapter. 3DreamBooth focuses on learning the 3D shape of an object from just *one* picture by carefully updating the model's understanding of the object's form, while keeping its movement consistent. Then, 3Dapter adds detailed textures and speeds up the process by cleverly using information from a few reference views to guide the customization, acting like a smart helper that knows what details to add based on the viewing angle.

Why it matters?

This work is important because it opens the door to more realistic and immersive experiences in virtual reality, augmented reality, virtual production (like making movies with digital sets), and even online shopping where you could see a product from any angle before you buy it. By truly understanding and preserving the 3D nature of objects, it allows for much more convincing and flexible video customization.

Abstract

Creating dynamic, view-consistent videos of customized subjects is highly sought after for a wide range of emerging applications, including immersive VR/AR, virtual production, and next-generation e-commerce. However, despite rapid progress in subject-driven video generation, existing methods predominantly treat subjects as 2D entities, focusing on transferring identity through single-view visual features or textual prompts. Because real-world subjects are inherently 3D, applying these 2D-centric approaches to 3D object customization reveals a fundamental limitation: they lack the comprehensive spatial priors necessary to reconstruct the 3D geometry. Consequently, when synthesizing novel views, they must rely on generating plausible but arbitrary details for unseen regions, rather than preserving the true 3D identity. Achieving genuine 3D-aware customization remains challenging due to the scarcity of multi-view video datasets. While one might attempt to fine-tune models on limited video sequences, this often leads to temporal overfitting. To resolve these issues, we introduce a novel framework for 3D-aware video customization, comprising 3DreamBooth and 3Dapter. 3DreamBooth decouples spatial geometry from temporal motion through a 1-frame optimization paradigm. By restricting updates to spatial representations, it effectively bakes a robust 3D prior into the model without the need for exhaustive video-based training. To enhance fine-grained textures and accelerate convergence, we incorporate 3Dapter, a visual conditioning module. Following single-view pre-training, 3Dapter undergoes multi-view joint optimization with the main generation branch via an asymmetrical conditioning strategy. This design allows the module to act as a dynamic selective router, querying view-specific geometric hints from a minimal reference set. Project page: https://ko-lani.github.io/3DreamBooth/

View Paper