Turbo3D: Ultra-fast Text-to-3D Generation

Hanzhe Hu, Tianwei Yin, Fujun Luan, Yiwei Hu, Hao Tan, Zexiang Xu, Sai Bi, Shubham Tulsiani, Kai Zhang

2024-12-10

Turbo3D: Ultra-fast Text-to-3D Generation

Summary

This paper talks about Turbo3D, a new system that can quickly create high-quality 3D models from text descriptions in under one second.

What's the problem?

Generating realistic 3D models from text has been slow and complicated, often requiring many steps and a lot of processing time. Previous methods struggled to produce high-quality results quickly, making it hard to use them in real-time applications.

What's the solution?

The authors developed Turbo3D, which uses a smart four-step process to generate 3D models efficiently. It combines two types of teachers: one that helps the model learn from multiple views of an object and another that focuses on making the output look realistic from a single view. By working in a special 'latent space' instead of directly with images, Turbo3D speeds up the process and improves the quality of the generated models. This allows it to create detailed 3D assets much faster than previous methods.

Why it matters?

This research is important because it makes it easier and faster to create 3D models from simple text prompts. This advancement can be used in various fields, including gaming, virtual reality, and design, where quick and realistic 3D content is essential for enhancing user experiences.

Abstract

We present Turbo3D, an ultra-fast text-to-3D system capable of generating high-quality Gaussian splatting assets in under one second. Turbo3D employs a rapid 4-step, 4-view diffusion generator and an efficient feed-forward Gaussian reconstructor, both operating in latent space. The 4-step, 4-view generator is a student model distilled through a novel Dual-Teacher approach, which encourages the student to learn view consistency from a multi-view teacher and photo-realism from a single-view teacher. By shifting the Gaussian reconstructor's inputs from pixel space to latent space, we eliminate the extra image decoding time and halve the transformer sequence length for maximum efficiency. Our method demonstrates superior 3D generation results compared to previous baselines, while operating in a fraction of their runtime.

View Paper